This wiki contains documents and discussions about Gen2Phen's efforts to produce exchange formats and consistent interfaces for LSDBs.
Standardising access to LSDB data is one of the core aims of Gen2Phen. Improved abilities to query, visualise and federate data can be achieved by having well-defined interfaces and consistent exchange formats. See the Use Cases wiki page for our consideration of what makes these issues important.
Note to self: add links to various projects & slides we've been passing around in the initial E-mail discussion.
The wiki-page is starting point for data standards related activities in gen2phen
See the current XML format: http://www.gen2phen.org/wiki/lsdb-xml-data-format and workshop notes: http://www.gen2phen.org/post/lsdb-minimal-requirements-johan-discussion-gen2phen-community-20-jan-2009-helsinki
See also other LSDB/G2P check list
This list is a mapping between the LSDB minimal data requirements as listed in
deliverable 3.4, LOVD 2.0/3.0 and the XML format currently under development (see link
above). As the XML format will be developed further, this list will be updated.
Last update: 2010-01-21.
| Name | D3.4 says | Availability in LOVD
2.0 |
Availability in LOVD 3.0 (planned) | Cafe Rouge | element or attribute currenly in XML format |
| Variant/Exon | Recommended | Standard, may be removed | Standard, may be removed | Recommended | variant/exon |
| Variant/DNA_genomic | Obligatory | Not available, but genomic position can automatically be generated | Always available (generated) |
(Obligatory)* | variant/name and scheme attribute = 'HGVS' and variant/type='DNA' |
| Variant/DNA_coding | Recommended | Always available | Always available | (Obligatory)* | variant/name and scheme='HGVS' and |
| Variant/RNA | Obligatory | Always available | Always available | Recommended (might not be available) |
variant/seq_change/variant/name |
| Variant/Protein | Obligatory | Always available | Always available | Recommended (might not be available) |
variant/seq_change/variant/name |
| Variant/DBID | Obligatory | Always available | Always available | Should be generated by LSDB | variant/id |
| Variant/Reference | Obligatory | Not standard LOVD; can be derived from Patient/Reference | Undecided | Obligatory | variant/ref_seq |
| Variant/DNA_published | Recommended | Sometimes available | Sometimes available | Recommended | variant/aliases/variant/name |
| Variant/Detection/Template | Obligatory | Not standard LOVD; can be derived from Patient/Detection/Template | Always available; as Screening/Template | Obligatory | variant/variant_detection/template |
| Variant/Detection/Technique | Obligatory | Not standard LOVD; can be derived from Patient/Detection/Technique | Always available; as Screening/Technique | Obligatory | variant/variant_detection/technique |
| Variant/DNA_remark | Recommended | Compatible with Variant/Remarks which is sometimes available | Sometimes available; as Variant/Remarks | Recommended | variant/comment |
| Variant/Frequency | Recommended | Standard, may be removed | Standard, may be removed | Recommended | variant/frequency |
| Variant/Origin | Recommended | Not standard LOVD, can only partially be derived from other (optional) columns | Undecided | Recommended | variant/origin |
| Variant/Restriction_site | Optional | Standard, may be removed | Standard, may be removed | Optional | variant/restriction_site |
| Variant/Allele | Recommended | Always available | Undecided | Recommended | variant/genetic_origin/source |
| Variant/Pathogenicity | Recommended | Always available | Undecided | Recommended | variant/pathogenicity |
| Patient/Patient_ID | Obligatory | Non-public information | Non-public information | Obligatory | patient/original_id |
| Patient/Phenotype/Disease | Obligatory | Always available | Always available |
Obligatory | patient/phenotype |
| Patient/Remarks | Recommended | Standard, may be removed | Standard, may be removed | Recommended | patient/comment |
| Patient/Origin/Geographic | Recommended | Sometimes available | Sometimes available | Recommended | patient/population (type="region") |
| Patient/Origin/Ethnic | Recommended | Sometimes available | Sometimes available | Recommended | patient/population (type="ethnic") |
| Patient/Gender | Recommended | Sometimes available | Sometimes available | Recommended | patient/gender |
| ID_submitterid | Obligatory | Sometimes available (field can be empty, which means the curator is the submitter) | Always available | Obligatory | source/id |
| Variant/HGNC gene Symbol | Obligatory | Obligatory | variant/gene/accession and source="HGNC" | ||
| Variant/Sharing policy (public/private) | Obligatory | variant/sharing_policy (since release 1.4) | |||
| Variant/Use permission (default Creative Commons 0) | Obligatory | variant/use_permission (since release 1.4) | |||
| Legend | |||||
| Always available: Needs modification of LOVD to
allow removal |
|||||
| Standard, may be removed: Is enabled by default but users are allowed to remove these columns | |||||
| Sometimes available: Is available in LOVD but not enabled by default; users can activate these columns | |||||
| Not standard LOVD: Some LOVD's (especially
Leiden-based) have these columns |
|||||
| Undecided: May be same or similar as in LOVD 2.0, but we haven't decided on the exact implementation yet. | |||||
| (Obligatory)*: One of these fields needs to be
present. |
|||||
This post discusses web service interfaces to Locus Specific Databases (LSDBs), from the specific point of view of visualising the data. The aim is to make our experiences of this at NGRL clear and to suggest, on this basis, why web services are desirable and in the broadest sense what they should do to satisfy the "variant browser" use case.
For visualisation of LSDB data, one of the major advantages of web service interfaces to LSDBs would be dynamic retrieval of variant data. These would ensure that the data could always remain up-to-date in the browser. In the NGRL Universal Browser, for instance, local copies are made for all external variant databases. This is simply because it is very rare to find a machine-to-machine interface to access such data. The only database that is accessed live is NGRL's own Diagnostic Mutation Database (DMuDB). Live access provides significant advantages in maintainability, particularly with regards to ensuring data is kept up-to-date and avoiding complex/long-winded import procedures.
Standard web service interfaces would also allow users to more easily visualise LSDB data without requiring that the browser software have any prior knowledge of the LSDB in question. With appropriate service discovery mechanisms and standardised interfaces any compliant LSDB could essentially be plugged in to any compliant visualisation tool. The standardised interfaces and exchange formats that would come with web services would also help avoid the potentially arduous task that currently presents itself to the developers of visualisation tools when seeking to integrate new LSDBs.
The problems include the following:
These are nearly all problems which are being addressed across Gen2Phen in general. Many of these problems are addressed by the LSDB-in-a-box approach, standardisation and LRGs. However, these do not entirely solve the problem of automatically integrating this data into a browser or automatically keeping it up-to-date.
From the point of view of our relatively simple scenario the operations we would require are along the lines of:
getSummary - IN none - OUT an LSDBSummary
getAllVariants - IN none - OUT a VariantList
getVariants - IN a Query - OUT a VariantList
getVariantByID - IN an ID string - OUT a Variant
For these operations, it is the output types in particular which would depend upon an interchange format. A good deal of work has already been done on the XML interchange format since this text was originally written, so brief notes regarding datatypes are only presented here to clarify the scenario we envision. The real job of building web service interfaces would involve taking the types defined in the interchange format as the starting point.
Query - would allow bounds to be specified on region (reference sequence), time of update and maybe also bounds on region specified in HGVS numbering?? This need not be dependent on the interchange format.
LSDBSummary - would present summary level info. about the LSDB, e.g. name, version, url, link-out-url, creation date, last updated, total entries, number of unique variants, Contact, Gene/s (the latter two would probably be ComplexTypes). A Gene, would for instance include the HGNC symbol and ID, the Entrez Gene ID, the MIM number, the reference sequence used. Personally I think we shouldn't limit the reference sequence to just be a LRG, although they make the job much easier. I think we have to be able to work with legacy data. The LSBSummary would probably be (partly?) dependent on elements of the interchange format.
VariantList = a sequence of Variant objects
Variant = ??? It is this I would really wait for the exchange format on, but it could closely follow the existing PAGE model, the minimal core information suggested by Johan, etc. We don't bring strong requirements in this regard as we are used with working with existing LSDBs. Some reference sequence information and an HGVS name would suffice for us. We are mainly interested in genotype information, and potentially some very "shallow" phenotype information. That is not to say that the format/services should not be designed with a much more generic usage in mind.
For performance and dependability reasons we do not want to have our browser as completely dependent on accessing data via external web services. Nor do LSDBs want to be burdened with lots of redundant calls to their services. We would therefore aim to cache LSDB data in a "lazy" fashion. Initially we would get all data from the LSDB and make a local copy. Each time a user browsed to a particular region in a gene we would then make a query for any variants in this region that had been added or updated since the last data of our cached version. If there were any changes we would update our cache. Either way we would update the timestamp on the cache.
One would probably also add further heuristics to avoid making too many calls, such as only checking for updates if the cache is more than X hours old. These differences in strategy don't have any effect on the requirements from the web service interfaces however.
An alternative implementation strategy, which might also be favourable for federation of LSDBs, might follow a publish-subscribe model. In this model clients (including browsers) would subscribe to the LSDB service and the LSDB would then "push" the data to them either whenever there were changes, or according to some schedule. This would have the disadvantage of requiring clients to implement call-back interfaces, which is not always feasible. It could potentially cause problems for LSDBs as well as they might end up with large subscriber lists and not know which were valid (although they could detect whether the call-back interface existed). It also requires some strategy to ensure that clients do not accidentally miss updates. All things considered this seems a more complex option.
We also have to address the question of whether something like DAS would be a better way to address the issue.
At the National Genetics Reference Laboratory (NGRL), Manchester we are investigating the development of prototype web services to allow machine-to machine access to LOVD data. Here, I hope to provide an overview, and record some of the issues involved as we develop the simplest of these services.
The simple service we aim to investigate first of all will provide a means of retrieving all public variants (and any associated public patient data) for a particular gene. This also implies a requirement to be able to get a list of genes supported by a particular LOVD database. For our Browser use case we are also interested in retrieving information about the reference sequence in use.
As this could be the basis of future, more fully featured services, ease of use and ease of installation are also important. Ease of use suggests a REST rather than/as well as a SOAP/WSDL interface. Ease of installation suggests that the service should be developed in PHP, as this will already be present on the target system, since LOVD is developed in PHP.
LOVD installation was relatively straightforward. The only problems encountered are well documented on the LOVD website. One of these was to do with strict settings in MySQL.
We investigated PHP Frameworks hoping to find something useful for rapid development of REST and SOAP interfaces and abstraction between the database, PHP classes, XML, etc. We initially looked at WSO2, but quickly found that it was not the quick lightweight solution we needed. It essentially needs rebuilt from source for different platforms, PHP versions, etc. This was not easy and did not meet our requirements. In the end we found the Zend framework to be useful, and have made use of the REST Server in particular.
We found it relatively easy to integrate with the existing PHP code and MySQL tables (centring around the many-to-many relationship between patients and variants) and have started to produce services that reproduce access to the publicly visible LOVD data.
We have so far concentrated on a REST interface to retrieve various information from an LOVD instance in a simple XML format. Because the aim of the exercise is prototyping and exploration we have not supplied an XML schema or stuck to the XML interchange format that is being developed as part of Gen2Phen. A more mature LOVD web service should aim to do this however.
The service is still at an early stage, but you can see the progress (on our own LOVD instance) using the
URLs below:
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getAllGenes -
returns all genes available at that particular LOVD instance
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getAllVariants&hgnc_symbol=UBE3A
- returns all variants for a particular gene (i.e. as reported in every
patient)
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getUniqueVariants&hgnc_symbol=UBE3A
- as above, but returns unique variants (i.e. one variant element, with 0-n
patient sub-elements)
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getVariantById&hgnc_symbol=UBE3A&id=UBE3A_00001
- return a single variant based upon the publicly visible database id.
The XML
formats returned are more or less what was easiest to produce, and they attempt
to reproduce the publicly visible information from LOVD. As you will see we have not dropped any empty optional elements in the
results, and are not returning LOVD URLs yet. Below is an example instance of a variant:
<variant>
<id>UBE3A_00001</id>
<exon> 08</exon>
<dna_change>c.3_16del14</dna_change>
<rna_change/>
<protein_change>Frame shift (predicted)</protein_change>
<restriction_site/>
<frequency>-</frequency>
<patient>
<pathogenicity>
<reported>Probably pathogenic</reported>
<concluded>Probably pathogenic</concluded>
<short>+?/+?</short>
</pathogenicity>
<id>003199(MC)</id>
<disease>Angelman syndrome</disease>
<reference/>
<template>DNA</template>
<technique>SEQ</technique>
<remarks>Parents not tested - out of frame deletion so pathogenicity assumed.</remarks>
<times_reported>1</times_reported>
<variant_created>2008-12-08 16:01:02</variant_created>
<variant_edited>2009-05-01 16:32:13</variant_edited>
<patient_created>2008-12-08 16:01:02</patient_created>
<patient_edited>2009-02-04 12:40:22</patient_edited>
</patient>
</variant>
The service is quite lightweight and only requires copying the PHP to your LOVD directory (SimpleXML module is required in PHP, but this is commonly enabled anyway).
Requiring authentication/authorisation:
Hi All
We have been discussing with Ivo about the data format and getting into following proposal:
- Use xml-elemnts and not attributes, due to extendability etc.
- Add separate elements for variation aliases and sequence changes. The latter one is for related, often consequential, sequence and structural changes that are often carried with the main variation entry. Aliases are like db_xrefs but can also have info on reference sequence + other details.
<variant>
<name>c.34G>C</name>
<naming_scheme>HGVS</naming_scheme>
<ref_seq>XY000000</ref_seq>
... then detection templates and other usual stuff in a same way
<aliases>
<variant>
<name>c.342G>C</name>
<naming_scheme>HGVS</naming_scheme>
<ref_seq>LRG000001</ref_seq>
</variant>
<variant>
<name>MUTXYZ</name>
<naming_scheme>FINDIS</naming_scheme>
<ref_seq>NM000001</ref_seq>
</variant>
</aliases>
<seq_change>
<variant>
<name>g.232323G>T</name> <!-- now the change is in genomic DNA -->
<ref_seq>AC0001</ref_seq>
<naming_scheme>HGVS</naming_scheme>
</variant>
<variant>
<name>A447RfsX11</name>
<naming_scheme>FINDIS</naming_scheme>
</variant>
</seq_change>
</variant>
============================
Comments:
The main Variation element is the reference entry people are working with. Basically the sub-variation elements can have same details of data, but this should be optional.
It is not always possible to say should related variation info go into alias or seq_change section. For example variations on different splice variant templates (cDNA templates). But perhaps this does not matter.
Perhaps we should also add a tag which tells is the seq_change experimentally verified or not.
Implementation specific things like global ids should go into attributes, if those are needed. (?)
<variant id="lsid://findis.org/variant/00001" />
Juha
(updated 2009-10-22; includes gene listing and some updates to the output format)
(updated 2009-12-27; described new additions and genomic locations)
(updated 2010-02-15; described new additions and world-wide LOVD querying service)
(updated 2010-04-20; changed all URLs to using rest.php in stead of rest)
Please note that since I still want to add a few features to this API, it's output format is currently not yet stable.
Since version 2.0-22, released October 5th, LOVD includes a simple webservice enabling simple queries or listing of variant data (not patient data), allowing the creation of a overall LOVD querying service. One can search on a gene symbol, get the list of available genes in the database, or on a per-gene basis, list all variants or search for a certain variant or DNA location.
The output it creates is an Atom 1.0 feed with basic variant information in plain text:
(snippet)
Genes:
<content type="text">
id:CRYAA
entrez_id:1409
symbol:CRYAA
name:Crystallin, alpha-A
chromosome_location:21q22.3
position_start:chr21:44589141
position_end:chr21:44592913
refseq_genomic:NC_000021.8
refseq_mrna:NM_000394.2
refseq_build:hg19
</content>
Variants:
<content type="text">
symbol:CRYAA
id:0000001
position_mRNA:NM_000394.2:c.27
position_genomic:chr21:44589236
Variant/DNA:c.27G>T
Variant/DBID:CRYAA_00001
</content>
The id field contains the internal ID of the variant entry. The position is read out from the Variant/DNA field, and interpreted by a Mutalyzer module, if possible. If the variant can not be interpreted by Mutalyzer, LOVD tries to isolate the position from the Variant/DNA field by itself. The genomic location will only be available if the gene has been configured properly (i.e. has proper reference sequence information associated), and Mutalyzer could interpret the variant correctly. The Variant/DBID is the field which is actually used to link back to LOVD, since it is shared by other variant entries with the same change on DNA level. The link to LOVD is included in Atom's <link> element of the entry.
The full output of a query returning one variant is:
<?xml version="1.0" encoding="ISO-8859-1"?>
<feed xmlns="http://www.w3.org/2005/Atom">
<title>
Results for your query of the CRYAA gene database
</title>
<link rel="alternate" type="text/html" href="http://chromium.liacs.nl/LOVDv.2.0-dev/"/>
<link rel="self" type="application/atom+xml" href="http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA"/>
<updated>2007-06-21T17:23:00+02:00</updated>
<id>tag:chromium.liacs.nl,2006-11-21:Chr:LOVDv.2.0-dev/REST_api</id>
<generator uri="http://www.LOVD.nl/" version="2.0-22">
Leiden Open Variation Database
</generator>
<rights>Copyright (c), the curators of this database</rights>
<entry xmlns="http://www.w3.org/2005/Atom">
<title>CRYAA:c.27G>T</title>
<link rel="alternate" type="text/html" href="http://chromium.liacs.nl/LOVDv.2.0-dev/variants.php?select_db=CRYAA&action=search_unique&search_Variant%2FDBID=CRYAA_00001"/>
<link rel="self" type="application/atom+xml" href="http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA/0000001"/>
<id>tag:chromium.liacs.nl,1970-01-01:CRYAA/0000001</id>
<author>
<name>Unknown</name>
</author>
<published>1970-01-01T00:00:00+01:00</published>
<updated>1970-01-01T00:00:00+01:00</updated>
<content type="text">
symbol:CRYAA
id:0000001
position_mRNA:NM_000394.2:c.27
position_genomic:chr21:44589236
Variant/DNA:c.27G>T
Variant/DBID:CRYAA_00001
</content>
</entry>
</feed>
Note that the actual variant content is currently in plain text format. Once the XML export format(s) are agreed on, I will implement that also.
Please note that I used "rest.php" in all the URLs here, although "rest" also works is most cases, without the PHP extension. However, on some servers you may still need the .php suffix. So for clarity, I use them here, too.
The webservice currently supports (please note that these links point to a development installation, not an actually maintained database):
Listing of all genes in the database:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes
Searching on the gene symbol (full match only):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_symbol=CRYAA
Showing only one specific gene entry:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes/CRYAA
Searching on the genomic position:
Chromosome only:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_position=chr21
Chromosomal location:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_position=chr21:44589236
Chromosomal range, exact match (only match genes having exactly this range):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_position=chr21:44589141_44592913&position_match=exact
Chromosomal range, exclusive match (only match genes completely within this range):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_position=chr21:44589141_44592913&position_match=exclusive
Chromosomal range, partial match (match any gene overlapping the given region):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/genes?search_position=chr21:44589141_44592913&position_match=partial
Listing of all variant entries in a certain gene:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA
Searching on the DNA position:
Coding DNA or genomic position, exact match only:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA?search_position=c.27
This does not allow for partial matches, so mutation c.27_28del is not matched. c.34 will match c.34+? and c.34_35 will match c.34+?_35-?. However, c.34 does not match c.34+5. Searching on genomic locations can be achieved using g. as a prefix.
Genomic position only, exclusive match (only match variants completely within this range):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA?search_position=g.44589000_44590000&position_match=exclusive
Genomic position only, partial match (match any variant overlapping the given region):
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA?search_position=g.44589000_44590000&position_match=partial
Searching on the DNA field:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA?search_Variant/DNA=c.27G>T
This does not allow for partial matches, but c.(27G>T) or c.27G>T? will also match.
Searching on the DBID field:
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA?search_Variant/DBID=CRYAA_00001
Showing only one specific variant entry (internal ID only).
http://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA/0000001
Starting at version 2.0-23, released December 7th, LOVD allows the generation of genomic locations of variants, provided a reference sequences has properly been configured in the database. We are using a new Mutalyzer tool for this, which is using information from the UCSC genome browser. A current problem with this information is that we can only map one version of a NM reference sequence to the genome; the UCSC data model does not allow for more versions of each reference sequence to be stored. We will change the datamodel of our local database to be able to store more versions of each NM transcript reference sequence, to be able to partially fix this problem.
Since the LOVD 2.0-24 API allows for searching for genes based on genomic position, it has become easier to utilize the LOVD APIs to quickly locate LOVD databases storing variants in a certain genomic region without the need to first find out which genes are located there. This way, the amount of queries needed per LOVD installation have reduced to usually one or two: 1) find gene databases on the given location, 2) if found, query that gene for any variants on the given location. Early February 2010, we have created a service that can query all LOVD installations that have selected to be published on the public list of LOVD installations on LOVD.nl (52 LSDBs with 1049 genes in total, 32 LSDBs with 822 genes have an useful LOVD version, results at 15/Feb/2010). We will test it using next-generation sequencing output to see how many of the variants found in the sequenced individual have been described somewhere in an LOVD on our list. This service will be later put online for the public to use.