Skip to main content
GEN2PHEN Knowledge Centre logo
Login or use OpenID
Need an account? Contact us
GEN2PHEN logo
  • Home
  • News
  • Events
  • Community
  • Data
  • About GEN2PHEN
Home » Groups » Web services and exchange formats

Web Service Interfaces to LSDBs - Browser Point of View

  • View
  • Revisions
Contributed by:Glen Dobson
Originally posted:29th May 2009: 1:26 pm
Last updated:25th November 2009: 10:00 am
Short URL:http://gen2phen.org/node/1939
Interest group icon Web services and exchange formats
Public document Public - anyone can view
Tweet
Table of Contents [hide]
    • Introduction
    • Problems
    • Required Operations
    • Implementation Issues

This post discusses web service interfaces to Locus Specific Databases (LSDBs), from the specific point of view of visualising the data. The aim is to make our experiences of this at NGRL clear and to suggest, on this basis, why web services are desirable and in the broadest sense what they should do to satisfy the "variant browser" use case.

Introduction

For visualisation of LSDB data, one of the major advantages of web service interfaces to LSDBs would be dynamic retrieval of variant data. These would ensure that the data could always remain up-to-date in the browser. In the NGRL Universal Browser, for instance, local copies are made for all external variant databases. This is simply because it is very rare to find a machine-to-machine interface to access such data. The only database that is accessed live is NGRL's own Diagnostic Mutation Database (DMuDB). Live access provides significant advantages in maintainability, particularly with regards to ensuring data is kept up-to-date and avoiding complex/long-winded import procedures.

Standard web service interfaces would also allow users to more easily visualise LSDB data without requiring that the browser software have any prior knowledge of the LSDB in question. With appropriate service discovery mechanisms and standardised interfaces any compliant LSDB could essentially be plugged in to any compliant visualisation tool. The standardised interfaces and exchange formats that would come with web services would also help avoid the potentially arduous task that currently presents itself to the developers of visualisation tools when seeking to integrate new LSDBs.

Problems

The problems include the following:

  • Heterogeneous Data Schema: Different LSDBs will have different fields, name them differently, etc. A separate parser is required for each different schema.
  • Heterogeneous Data Formats: Even if the data is the same the file formats can differ (e.g. HTML, XML, TSV, CSV, etc.). Separate code to read the file is needed for each different file type.
  • Heterogeneous Transport Protocols: How one gets hold of the data varies between LSDBs e.g. HTTP (GET versus POST), FTP, SOAP. Separate code is potentially needed to retrieve data using each different transport protocol.
  • Lack of Machine Interpretability: The amount of structure present in LSDBs varies. Generally, the greater reliance on natural language, the harder it is to process data automatically. Different terminology for the same concept is also common. Each different terminology requires more code.
  • Heterogeneity in Nomenclature: HGVS nomenclature is by no means universally used. Where it is used it is often incorrect. For each different nomenclature and each different misuse of the nomenclature special cases have to be coded.
  • Heterogeneity in Reference Sequences: In the worst case LSDBs do not state a reference sequence at all. Where they do, it is often necessary to resolve data from different sources to one reference sequence so that they can be viewed side-by-side. Different code is needed for each different source and format of reference sequence, and the task is hard to automate due to sequence and transcription variation.
  • Unpredictable Data Updates: Often there is no regular update schedule for an LSDB. Moreover there is sometimes no easy way of checking whether an update has occurred. It is therefore difficult to automate updates. At worst the update will be not only to the data, but to the schema, etc. New instances of all of the problems mentioned above may also come into play when reflecting LSDB updates. The process can therefore be time consuming and require more new code to be written.
  • The Inability to Query the LSDB: Although making a local copy of all relevant LSDB data can be a viable option (allowing one to create one's own local query engine) in some cases it might be more suitable to query the data on the fly. For instance, in a browser it might be most suitable to get data for the visible genomic region on the fly. Similarly, where local copies of the data are cached it would be nice to query by date in order to retrieve only the data that has not yet been cached. There is no standard way to perform such queries with current LSDBs and it is often impossible.
  • Poor Quality Data: LSDB data is often wrong. Errors in numbering will often be immediately obvious in a browser, whereas other errors will not. Web services interfaces would do little to help with this, but standardised interchange formats, LRGs and supporting tools may help improve the numbering problems at least. This would mean that more data could be visualised and potentially that less checking was necessary at the browser end.

These are nearly all problems which are being addressed across Gen2Phen in general. Many of these problems are addressed by the LSDB-in-a-box approach, standardisation and LRGs. However, these do not entirely solve the problem of automatically integrating this data into a browser or automatically keeping it up-to-date.


Required Operations

From the point of view of our relatively simple scenario the operations we would require are along the lines of:

getSummary - IN none - OUT an LSDBSummary
getAllVariants - IN none - OUT a VariantList
getVariants - IN a Query - OUT a VariantList
getVariantByID - IN an ID string - OUT a Variant

For these operations, it is the output types in particular which would depend upon an interchange format. A good deal of work has already been done on the XML interchange format since this text was originally written, so brief notes regarding datatypes are only presented here to clarify the scenario we envision. The real job of building web service interfaces would involve taking the types defined in the interchange format as the starting point.
Query - would allow bounds to be specified on region (reference sequence), time of update and maybe also bounds on region specified in HGVS numbering?? This need not be dependent on the interchange format.


LSDBSummary - would present summary level info. about the LSDB, e.g. name, version, url, link-out-url, creation date, last updated, total entries, number of unique variants, Contact, Gene/s (the latter two would probably be ComplexTypes). A Gene, would for instance include the HGNC symbol and ID, the Entrez Gene ID, the MIM number, the reference sequence used. Personally I think we shouldn't limit the reference sequence to just be a LRG, although they make the job much easier. I think we have to be able to work with legacy data. The LSBSummary would probably be (partly?) dependent on elements of the interchange format.


VariantList = a sequence of Variant objects


Variant = ??? It is this I would really wait for the exchange format on, but it could closely follow the existing PAGE model, the minimal core information suggested by Johan, etc. We don't bring strong requirements in this regard as we are used with working with existing LSDBs. Some reference sequence information and an HGVS name would suffice for us. We are mainly interested in genotype information, and potentially some very "shallow" phenotype information. That is not to say that the format/services should not be designed with a much more generic usage in mind.

Implementation Issues

For performance and dependability reasons we do not want to have our browser as completely dependent on accessing data via external web services. Nor do LSDBs want to be burdened with lots of redundant calls to their services. We would therefore aim to cache LSDB data in a "lazy" fashion. Initially we would get all data from the LSDB and make a local copy. Each time a user browsed to a particular region in a gene we would then make a query for any variants in this region that had been added or updated since the last data of our cached version. If there were any changes we would update our cache. Either way we would update the timestamp on the cache.

One would probably also add further heuristics to avoid making too many calls, such as only checking for updates if the cache is more than X hours old. These differences in strategy don't have any effect on the requirements from the web service interfaces however.

An alternative implementation strategy, which might also be favourable for federation of LSDBs, might follow a publish-subscribe model. In this model clients (including browsers) would subscribe to the LSDB service and the LSDB would then "push" the data to them either whenever there were changes, or according to some schedule. This would have the disadvantage of requiring clients to implement call-back interfaces, which is not always feasible. It could potentially cause problems for LSDBs as well as they might end up with large subscriber lists and not know which were valid (although they could detect whether the call-back interface existed). It also requires some strategy to ensure that clients do not accidentally miss updates. All things considered this seems a more complex option.

We also have to address the question of whether something like DAS would be a better way to address the issue.

 

5
Your rating: None Average: 5 (1 vote)
‹ LSDB Data Format up Developing Prototype LOVD web services at NGRL ›
Tags:
  • Other
  • WP3
  • WP4
  • WP6
  • WP7
  • Printer-friendly version
  • Login to post comments

Comments

Comments

#1 Regarding DAS, I reckon you

Submitted by Gudmundur A Thorisson on Fri, 10/07/2009 - 16:55.

Regarding DAS, I reckon you would want to support this for genomic-interval queries returning sets of sequence features, at least as one of potentially several formats. Something like this:  

  • DAS XML => display variant features in the region in Ensembl or other browsers 
  • Regular XHTML => tabular summary of variants in the region 
  • Atom XML feed => for aggregating via feed reader or other databases 

PS I'll have more to say on the Atom feed concept in a separate post I am working on.

  • Login to post comments

#2 Nice post Glen! In case you

Submitted by antbro on Fri, 17/07/2009 - 00:51.

Nice post Glen!

In case you did not know:
- Christophe Beroud (GEN2PHEN Partner 'INSERM') who runs the UMD LSDB and David Atlan (GEN2PHEN Partner 'PHENO') from PhenoSystems have together put a custom webservice on top of UMD, coupled with a function in PhenoSystems diagnostics software that access it.
- Ivo Fokkema and Johan den Dunnen (GEN2PHEN Partner 'LUMC') who run the LOVD LSDB, are designing a bespoke webservice to sit atop their system, but must first be able to provide variants positions in genome coordinates rather than relative to whatever reference sequence each gene database is based upon

I am not sure how much you and these groups are comparing notes and synergising your activities in these areas?

  • Login to post comments

#3 Would it be possible if

Submitted by Gudmundur A Thorisson on Sun, 19/07/2009 - 12:07.

Would it be possible if UMD/PhenoSystems could share with us information on the new web service implemention? It would be useful to have this available for discussion in the group. Same for the planned LOVD service.

  • Login to post comments

#4 Well, we have nothing yet;

Submitted by Ivo F.A.C. Fokkema on Wed, 22/07/2009 - 13:30.

Well, we have nothing yet; not on paper nor any code. But thanks to this article and the REST vs. SOAP article from Mummi, I know better where to start.
I would be very interested in taking a look at that UMD implementation, too.

  • Login to post comments

Web services and exchange formats wiki

  • Standards
  • LSDB minimal requirements (D3.4), compared to LOVD and the LSDB XML format
  • LSDB Data Format
    • Web Service Interfaces to LSDBs - Browser Point of View
    • Developing Prototype LOVD web services at NGRL
    • Exchange format for LSDBs
  • Development of LOVD RESTful / Atom webservice
  • Group home
  • Wiki

Web services and exchange formats

  • You must register/login in order to post into this group.
G2P Knowledge Centre is part of GEN2PHEN and funded by the Health Thematic Area of the Cooperation Programme of the European Commission within the VII Framework Programme for Research and Technological Development.

© GEN2PHEN 2011
Follow @gen2phen
  • Contact Us