Developing Prototype LOVD web services at NGRL
| Contributed by: | Glen Dobson |
| Originally posted: | 28th September 2009: 4:32 pm |
| Last updated: | 25th November 2009: 12:50 pm |
| Short URL: | http://gen2phen.org/node/6647 |
At the National Genetics Reference Laboratory (NGRL), Manchester we are investigating the development of prototype web services to allow machine-to machine access to LOVD data. Here, I hope to provide an overview, and record some of the issues involved as we develop the simplest of these services.
Requirements
The simple service we aim to investigate first of all will provide a means of retrieving all public variants (and any associated public patient data) for a particular gene. This also implies a requirement to be able to get a list of genes supported by a particular LOVD database. For our Browser use case we are also interested in retrieving information about the reference sequence in use.
As this could be the basis of future, more fully featured services, ease of use and ease of installation are also important. Ease of use suggests a REST rather than/as well as a SOAP/WSDL interface. Ease of installation suggests that the service should be developed in PHP, as this will already be present on the target system, since LOVD is developed in PHP.
Implementation Notes
LOVD installation was relatively straightforward. The only problems encountered are well documented on the LOVD website. One of these was to do with strict settings in MySQL.
We investigated PHP Frameworks hoping to find something useful for rapid development of REST and SOAP interfaces and abstraction between the database, PHP classes, XML, etc. We initially looked at WSO2, but quickly found that it was not the quick lightweight solution we needed. It essentially needs rebuilt from source for different platforms, PHP versions, etc. This was not easy and did not meet our requirements. In the end we found the Zend framework to be useful, and have made use of the REST Server in particular.
We found it relatively easy to integrate with the existing PHP code and MySQL tables (centring around the many-to-many relationship between patients and variants) and have started to produce services that reproduce access to the publicly visible LOVD data.
The Web Service
We have so far concentrated on a REST interface to retrieve various information from an LOVD instance in a simple XML format. Because the aim of the exercise is prototyping and exploration we have not supplied an XML schema or stuck to the XML interchange format that is being developed as part of Gen2Phen. A more mature LOVD web service should aim to do this however.
The service is still at an early stage, but you can see the progress (on our own LOVD instance) using the
URLs below:
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getAllGenes -
returns all genes available at that particular LOVD instance
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getAllVariants&hgnc_symbol=UBE3A
- returns all variants for a particular gene (i.e. as reported in every
patient)
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getUniqueVariants&hgnc_symbol=UBE3A
- as above, but returns unique variants (i.e. one variant element, with 0-n
patient sub-elements)
http://ngrl.man.ac.uk/lovd2/ws/rest.php?method=getVariantById&hgnc_symbol=UBE3A&id=UBE3A_00001
- return a single variant based upon the publicly visible database id.
The XML
formats returned are more or less what was easiest to produce, and they attempt
to reproduce the publicly visible information from LOVD. As you will see we have not dropped any empty optional elements in the
results, and are not returning LOVD URLs yet. Below is an example instance of a variant:
<variant>
<id>UBE3A_00001</id>
<exon> 08</exon>
<dna_change>c.3_16del14</dna_change>
<rna_change/>
<protein_change>Frame shift (predicted)</protein_change>
<restriction_site/>
<frequency>-</frequency>
<patient>
<pathogenicity>
<reported>Probably pathogenic</reported>
<concluded>Probably pathogenic</concluded>
<short>+?/+?</short>
</pathogenicity>
<id>003199(MC)</id>
<disease>Angelman syndrome</disease>
<reference/>
<template>DNA</template>
<technique>SEQ</technique>
<remarks>Parents not tested - out of frame deletion so pathogenicity assumed.</remarks>
<times_reported>1</times_reported>
<variant_created>2008-12-08 16:01:02</variant_created>
<variant_edited>2009-05-01 16:32:13</variant_edited>
<patient_created>2008-12-08 16:01:02</patient_created>
<patient_edited>2009-02-04 12:40:22</patient_edited>
</patient>
</variant>
The service is quite lightweight and only requires copying the PHP to your LOVD directory (SimpleXML module is required in PHP, but this is commonly enabled anyway).
Issues
- Many LSDBs do not specify reference sequences, making raw HGVS nomencalture difficult to interpret
- No simple mapping/binding framework from PHP classes to XML was found. The prototype could therefore be hard to maintain as the target XML schema becomes more complex.
Potential Further Work
- Demonstrate visualisation using NGRL browser (this does not require genomic coordinates)
- Rationalise services with other efforts
- Align with XML interchange format and provide feedback on changes to this format
- Get variant by region
- Get variant by feature (5', exon/intron, 3')
- Get variants updated/added since XXX
- Get suggested reference sequence given HGNC gene symbol, list of HGVS variants
- Machine accessible registry service for discovering LOVD web services
Requiring authentication/authorisation:
- Allow viewing of non-public data (e.g. for admin, curators)
- Service to allow submission of data (e.g. from other software)
- Printer-friendly version
- Login to post comments

Comments
Comments
#1 That's terrific! How
That's terrific!
How difficult is it to integrate with an existing LOVD installation and could you document the process so I can do it with mine?
Cheers
Tomasz
#2 Great to hear about these
Great to hear about these development, Glen! Things certainly seem to be moving along. However, I am slightly puzzled. In your 'Requirements' section, you mention that you are using LOVD as your variant database (which I did not realise before), and given your use cases and example URLs you seem to have implemented basically the same features as Ivo on top of the same kind of data. Unless I am missing something, this seems like duplication of work.. ?
That aside, I want to make the following comments on the implementation:
Generally, for both your and Ivo's WS's, I advise that you try to eliminate the .php from the URLs. It is of no concern to the user how the service is implemented (i.e. that you're implementing the service as a PHP script). Also, if you later implement the service as, say, a Java servlet or Perl script you're stuck with an ugly, misleading .php embedded in the URL which you cannot easily change because by then it will be in use 'in the wild'. Have a a look at "Cool URIs don't change" at http://www.w3.org/Provider/Style/URI.
Also, I want to propose the following minor variations on your URLs, to streamline them and make them more RESTful and aligned with the LOVD ones:
http://ngrl.man.ac.uk/lovd2/ws/rest/genes - returns all genes available at that particular LOVD instance
http://ngrl.man.ac.uk/lovd2/ws/rest/allvariants/UBE3A - returns all variants for a particular gene (i.e. as reported in every patient)
http://ngrl.man.ac.uk/lovd2/ws/rest/uniquevariants/UBE3A - as above, but returns unique variants (i.e. one variant element, with 0-n patient sub-elements)
http://ngrl.man.ac.uk/lovd2/ws/rest/variant/UBE3A_00001 - return a single variant
Note: In the last example, surely the gene symbol (present in your original example) is not required since the variant ID is unique within the database?
As I described in my previous post to the group, a major part of the REST style is to model things as resources, and for each resource create a URL which is a noun. Clients then invoke standard HTTP methods (GET, POST etc.) on these resources. This contrasts with the approach of using single URL endpoint via which clients invoke one of several non-standardized methods which are entirely specific to your particular application. Which is basically the way your single URL endpoint and method=getUniqueVariants/etc setup works as it stands. This may sound like a subtle distinction, but doing the former is a surprisingly powerful way to keep the API simple and easy to understand.
Furthermore, I suggest you follow Ivo's lead and try out Atom XML as a 'wrapper' around elements (see details on this in my recent post here: http://www.gen2phen.org/node/7142). This immediately enables websites such as the Knowledge Centre to pull in your feeds and re-publish to a wider audience.
Lastly, on your point about registry service: such a service already exists: the recently-launched http://www.biocatalogue.org registry for bioinfo web services is the product of work by your close neighbours in Manchester, the EBI and others working on myGrid/OMII-UK/Taverna. BioCatalogue is currently centred on WS's described with WSDL since most existing bioinfo-WS's are SOAP-based. But I believe BC developers are working on support for RESTful WS's described with the much simpler WADL language.
#3 Tomasz, as you will have seen
Tomasz, as you will have seen Ivo has also worked on something very similar. It would be worth looking at the latest version of LOVD and using that perhaps? However, if you are still interested in the service that I have developed then send me a direct mail and I can share it with you.
#4 Mumi, you are correct. Ivo
Mumi, you are correct. Ivo and I have both ended up producing very similar things. This has involved some duplication of work, although neither prototype has taken very long to produce. Essentially Ivo surprised me by producing his prototype in the space of a day!!! The advantage is that this allows some fairly straightforward discussions to take place by direct comparison.
For instance, the service I have produced returns (nearly?) all publicly visible data, but Ivo has rightly pointed out that to launch this level of access in LOVD itself would take a lot more work to allow curators to configure which fields they actually want to publish via the web service channel.
My minor gripe with Ivo's service is probably the inclusion of the main content in a single text tag rather than using XML to structure it. This is perfectly understandable in a first prototype however.
Re. "our" database we have the DMuDB (our own proprietary database containing variants from UK diagnostic labs), but we host an LOVD with a few genes in it, mainly for UK diagnostic users who request a public variant database.
The reason I was interested in prototyping LOVD web services was largely unconnected to our databases however, as I am mainly interested in visualisation of data in public LOVD instances. All I want for this is a standard interface providing sufficient data.
I completely agree about everything you say re. URLs. What I have presented is basically what was fastest to produce. The PHP framework I used enforced these URLs, but I am not sure that the framework was providing huge benefits in the end anyway (other than this URL mapping). If I did develop this web service further then I would probably drop the framework and use more RESTful URLs! However, since Ivo seems to be making progress I am more likely to try and make use of his service unless there is strong reason not to.
Yes, have seen biocatalogue btw - though haven't looked in detail. Presumably there is a WS interface to do lookups? I still wonder whether a simple LSDB specific registry might not be useful for federation and discovery purposes. It is probably a moot point at this early stage anyway.
#5 Hi guys, Mummi: Although I
Hi guys,
Mummi: Although I agree with you that it's best, it will not be possible to _always_ remove the .php from the URL. "Not possible" in this case means: I want people to be able to install the LOVD update without needing a server administrator to change webserver configuration settings. I could provide a settings file with LOVD that does just what you want, but only on Apache. On non-Apache webservers that file is ignored and the user has to use .php in order not to get an error.
Actually, the examples that I gave in my post also work without the .php, because of the way that server is configured.
About the ID having the gene symbol in there:
http://ngrl.man.ac.uk/lovd2/ws/rest/variant/UBE3A_00001
The gene symbol is required, since every gene has it's own set of variants and variant ID's. "Variant 00001" does not indicate in which gene LOVD needs to find that variant. This will change in LOVD 3.0.
Other than that, ID's in the form of "UBE3A_00001" are not necesarily unique in the database, because these aren't the internal primary key values. There may be several variant entries in the UBE3A variant table with that value in the DBID field (they must all share the same DNA field value, though), all with a different variantid value (the actual primary key).
Glen: I agree with the usefullness of an LSDB registry. Of course there is the HGVS list, but it's static and updates take a long time. I have once developed a dynamic system based on LOVD that would be easy to build a WS upon, but that data is now outdated:
http://grenada.lumc.nl/www.hgvs.org/dblist/new/lsdb.php
Besides that, the list of known LOVD installations worldwide on the LOVD website would also be quite easy to extend with a WS, but then we'd only have LOVD installations.
#6 Ivo, regarding the registry:
Ivo, regarding the registry: what you are describing sounds rather close to the Database Description Framework (DDF) which "allows resources to describe key technical metadata in a formalised way". The relevance to an LSDB registry is that DDF is primarily intended to help find database resources and at a glance see their technical capabilities etc. DDF underpins the Mouse Resource Browser: http://www.fleming.gr/mrb
We are working with CASIMIR to adapt DDF for human G2P resources (Adam Webb is the contact). Here's a new Drupal DDF site created by them: http://casimir1.pdn.cam.ac.uk/casimir_ddf/
I don't see why you couldn't apply those principles to LOVDs and add each instance as a DDF entry. Could perhaps populate a DDF registry automically.
#7 Regarding BioCatalogue: it
Regarding BioCatalogue: it doesn't make sense to me to reinvent the wheel and create an LSDB-specific registry for machines to discover and invoke WS's, not if there is already a Life Sciences-wide registry available for this purpose.
In any case, you could probably achieve the same thing by adding LOVD WS's to BC, then querying their API and pull out a subset of services tagged as LOVD or LSDB or similar, and publish this on, say, the Knowledge Centre (gimme feeds, feeds, feeds!). Worth investigating at some stage, but premature now, I agree with Glen.
#8 Ivo, regarding the URLs. A) I
Ivo, regarding the URLs. A) I understand about the .php extension, so I guess I'll fall back to suggesting that you add something similar Drupal's 'enable clean URLs', so that IF the webserver used support it and is configured poperly, THEN the 'clean' URLs are produced everywhere (in web pages, feeds etc.)
B) From an API user perspective, it doesn't sound too bad to use something like /variant/[gene symbol]/[variant ID] (or perhaps gene/[symbol]/[variant ID] if the database is partitioned by gene. Either way, it's the general principle I'm pointing out here, to link directly back to the 'thing' you're describing, in a way that a computer (and not just a human) can actually navigate the links and retrieve the data.
#9 Great to see progress on the
Great to see progress on the WS side for LOVD!
From the diagnostics user perspective, following requests would be useful:
mutations/variants on a given gene at c.### or c.###+-## (yes, I understand we have a reference sequence/transcript issue)
mutations/variants on a given gene at all 3 nucleotides of a codon given by c.###
Cheers,
David.
#10 Thanks Mummi for taking this
Thanks Mummi for taking this up, clean URLs are important. Here is the first page I got from google search. Hope it is useful :-) http://evolt.org/Making_clean_URLs_with_Apache_and_PHP
#11 I'm wondering - wouldn't it
I'm wondering - wouldn't it just be much more easy to manually register 1 LOVD query tool, with one location, in stead of trying to automatically register the 51 databases on the list of public LOVD installations, not even mentioning the necessary automated update or removal if someone changes the configuration in LOVD?
About the clean URLs: I still don't get the point. What's the use in trying to use clean URLs when you _always_ need to be able to fallback on the non-clean URLs when you encounter an LOVD instance not running on Apache? If you need to write code to query an LOVD, would you seriously go and try two separate URLs, if it could have been just one URL that happens to include a .php extension? I agree that virtually all LOVDs are running on Apache (and thus automatically support the clean URL method), but I can never force users to use Apache.
#12 Hi David, The LOVD webservice
Hi David,
The LOVD webservice already supports searching on location (see my post http://www.gen2phen.org/post/development-lovd-restful-atom-webservice), but it needs some tweaking to allow to search for ranges also.
c.### and c.###+-## formats are supported. Codon numbers aren't, but that would only require a simple calculation to convert to c.### format.
#13 Ivo, about the URLs yet
Ivo, about the URLs yet again: you could use the same argument to argue for never using anything other than Word for collaborative writing (no wikis, no Google Docs etc.), because that is the fallback and 'lowest common denominator' that everyone has installed and knows how to use. Or supporting Internet Explorer 6 forever.
Also, if I may use your own argument against you: two or three years from now, do you *really* want all LOVD API clients out there to be forced to change their code because you have changed your implementation from PHP to Python, and now all your URLs must have the .py suffix? That seems to me like a far worse problem.
One compromise I can propose: just accept that a handful of LOVD installations can't or won't be able to publish the clean URLs, and try (via docs and/or warnings in the admin UI) to make administrators of these LOVDs aware of the disadvantages in terms of API accessibility and uniformity (compared to the 'virtually all' other LOVDs which will respond to & publish clean URLs).
#14 Hi Mummi, Well, I guess we
Hi Mummi,
Well, I guess we should maybe agree that we don't agree :)
I don't think this is the correct place to continue this discussion, so if you'd want me to respond to your examples I think we should do that over email or a phone discussion, after which we could post the outcome conclusion, if any, to the KC. How about that?