Exchange format for LSDBs

Hi All

We have been discussing with Ivo about the data format and getting into following proposal:

- Use xml-elemnts and not attributes, due to extendability etc.
- Add separate elements for variation aliases and sequence changes. The latter one is for related, often consequential, sequence and structural changes that are often carried with the main variation entry. Aliases are like db_xrefs but can also have info on reference sequence + other details.

<variant>
 <name>c.34G>C</name>
  <naming_scheme>HGVS</naming_scheme>
  <ref_seq>XY000000</ref_seq>

 ... then detection templates and other usual stuff in a same way

  <aliases>

    <variant>
       <name>c.342G>C</name>
       <naming_scheme>HGVS</naming_scheme>
       <ref_seq>LRG000001</ref_seq>
     </variant>

     <variant>
       <name>MUTXYZ</name>
       <naming_scheme>FINDIS</naming_scheme>
       <ref_seq>NM000001</ref_seq>
     </variant>

  </aliases>

   <seq_change>

      <variant>
         <name>g.232323G>T</name>  <!-- now the change is in genomic DNA -->
         <ref_seq>AC0001</ref_seq>
         <naming_scheme>HGVS</naming_scheme>
      </variant>

      <variant>
         <name>A447RfsX11</name>
        <naming_scheme>FINDIS</naming_scheme>
      </variant>

   </seq_change>

</variant>

============================

Comments:

The main Variation element  is the reference entry people are working with. Basically the sub-variation elements can have same details of data, but this should be optional.

It is not always possible to say should related variation info go into alias or seq_change section. For example variations on different splice variant templates (cDNA templates). But perhaps this does not matter.

Perhaps we should also add  a tag which tells is the seq_change experimentally verified or not.

Implementation specific things like global ids should go into attributes, if those are needed. (?)

<variant id="lsid://findis.org/variant/00001" />


Juha

4
Your rating: None Average: 4 (1 vote)

Comments

>Use xml-elements and not attributes, due to extendability etc.

Juha, can you elaborate on this point, please? I don't quite see how using only elements helps with extensibility, but it would certainly make the markup much more verbose! You could also use the same arguments to criticize the current LRG XML spec , which is a mix of elements and attributes (see e.g. sample files here: ftp://ftp.ebi.ac.uk/pub/databases/lrgex/).

Name, which is usually obvious choice for attribute, can become rather complex and thus element would be a better option. And then why should we use attributes for others if name isn't attribute. Also, elements can give room to add more structure later if needed, but perhaps this is not valid argument since it will likely break loaders.

I am not XML expert and do not have strong opinion on this. Happy to use attributes as well...

I find elements much easier to parse and extend. The decision is arbitrary though:
http://www.ibm.com/developerworks/xml/library/x-eleatt.html

Just to give a counterexample to LRG, BioPortal web services are element-only :)

Tomasz, thanks for the useful link!

Juha - on a general, perhaps intuitive note, data elements of a simple type which have a one-to-one relationship with the parent element usually work fine as attributes. For instance, it would be fairly conventional to do something like this:

<variant id="foobar" datecreated="2009-11-01" datelastmodified="2009-12-01">
[more complex stuff]
</variant>

But regarding that specific element, 'name' which one would normally assume is merely a simple string or label - in fact this is a much more complex structure which isn't a name anymore. Which brings me to another point: the HGVS string/name is a hugely semantically-overloaded construct. I can truly understand how having the label is useful when working with an LSDB, don't get me wrong. However, if that's all you provide in the XML then that places a burden on the data consumer, namely to extract the information embedded in the name according to various idiosyncrasies of naming schemes etc.

Therefore, I suggest that the coordinates and type (genomic, cDNA) information be presented separately in the XML separately in a structured way, along with the HGVS name which can then be treated (by the client) as an opaque label for the variant.

I agree. Important to have the info also separately.

>Implementation specific things like global ids should go into attributes, if those are needed. (?)
><variant id="lsid://findis.org/variant/00001" />

I think we should not sidestep the issue of IDs as something that will be 'dealt with' in some implementation or the other. Why not put it in the spec that the globally-unique identifier held in this attribute should ideally be a http-resolvable URI? For instances, LRG's which have a permant ID should (at some point I hope) have a stable URI which can be followed by clients to look up to find 'useful information' (as in Linked Data):

<variant uri="http://findis.org/variant/00001">
<ref_seq uri="http://[future base URL for LRGes/LRG000001" />
[more stuff]
</variant>
If a URI for some reason is not feasible for a given identifier, then alternatively (or additionally) provide a fully-fledged db-crossref construct containing accession AND database/namespace to go with it, something like this:

<ref_seq>
<db_xref accession="LRG000001" databaseName="LRG" />
</ref_seq>

This is pretty traditional - see e.g. DatabaseReference and Database in the Fuge model and a similar construct in PaGE.

My point here is that if I (speaking as a potential consumer/aggregator of this variant information) am given some opaque identifier without a context, I have no idea what to do with it Variant ID "00000012"? reference sequence LRG000003?

To counter this, I suggest explicitly stating in the spec that the variant should have (ideally) a URI (and emphasize the Linked Data connection) or, failing that, a full db_xref.

It is important to include the context of course (example was illustrative), despite it is often implicitly available from application context. Would be nice to have only one way to do that though. Either using URIs or db_xrefs.. Can we formulate URIs for those "non feasible" identifiers also? ...But perhaps this don't matter.

Hi guys,

To trow my 2 cents in, I found Tomasz' link really useful, and propose the following adaptations / agree with the following suggestions:

- "naming_scheme" would be standardized enough in my opinion, to be an attribute of "name".

- I like the suggestion to add "uri" or "id" attributes to the "variant" element, carrying a resolvable URI.

- I believe dates / times should not be an attribute. You could simply define the structure as "YYYY-MM-DD" but a true timestamp should carry time zone information as well, making them in my opinion too extensive to be a simple attribute.

- If the "ref_seq" points to an URI, I'd prefer the format name as an added attribute; like the "database_name" that comes with an accession number, as suggested by Mummi.

Thanks Ivo. I made a wiki page for the format (http://www.gen2phen.org/wiki/lsdb-xml-data-format) and put example on consensus format there as I have understood it. Please feel free to edit the page. I think that once we have the basic principles right we can easily add the missing elements.

Following up on the coordinates thing - We're talking about passing around what are essentially features on some reference sequence or the other, right? Has it been investigated whether existing XML sequence feature formats and exchange protocols can be reused & extended for this purpose?

I don't mean to be a party-pooper here, but oftentimes it seems like we're getting awefully close to inventing yet another feature format....

Another interesting developments are variant calling and alignment formats (see http://samtools.sourceforge.net/). VCF pretty much overlaps with the HGVS naming scheme. Is this something which should be addressed also on the LRG level as well ?

Interesting. Have you tried asking in the LRG group?http://www.gen2phen.org/groups/locus-reference-genomics-lrg

The connection between VCF and LRG is not entirely clear to me. The former is a naming scheme for variants and the latter is a reference sequence format. Any naming scheme (VCF or HGVS) can choose to adopt any reference sequence format. Neither should be tied to a particular format. It's a bit like comparing apples and pears.

That said, VCF does have some interesting features that could possibly be added to HGVS. However, I did note one error in the example VCF entry provided at the 1000 Genomes site:

http://1000genomes.org/wiki/doku.php?id=1000_genomes:analysis:vcfv3.2

It states that "...that every key must have a value..." but then lists the key "DB" in data lines 1 and 3 with no value. I presume that it should say "DB=1" in both instances.

Raymond - I have to disagree with you there: VCF is not a naming scheme, it´s format for representing variant sites in the genome sequence in a structured (and rather GFF-like) way. The HGVS scheme is a system for creating handy string labels for variants, albeit labels which are semantically-overloaded (see above: http://www.gen2phen.org/post/exchange-format-lsdbs#comment-165). So, another apples vs [some other fruit] comparison, I feel!

The ALT field provides a description or "name" for variant in a similar fashion than the HGVS does afaik. In principle this is just an another naming scheme we can accommodate (see example above). Purpose is that the details are given in LRG entry which is referenced from the LSDB element. Alternatively LSDB element may also have LRG embedded in if these details are needed to carry in same message.

Juha, I can't seem to be able to find such an example. ALT in the example on the link Raymond provided contains "alternate non-reference alleles" - but that means something like "A" or "G,T", or ".". To understand the whole mutation, you'll need at least three other fields - chromosome (CHROM), position (POS), and (I believe) the reference nucleotide (REF). The strength of the HGVS nomenclature is in providing all this information in one field - although re-interpreting that information from that one field is harder to code.

Right. the ALT field has editing instructions and location is on another place. It is computationally simple because all you need to add is piece of sequence and position counted from the first base. It can be nailed down to a known reference sequence as well and thus it works as a basis of a scheme to create global collaborational ids or names. The HGVS name has additional semantics which is important for clinicians so it is better suited for creating semantic ids. Question is actually should we have also palace for computationally simple "editing instructions" and where that info should go ?

Ivo / Juha - the 'strength' of the HGVS nomenclature obviously lies in its ease-of-reading by a human. But what we are concerned with here is data exchange between computers and machine-readability of this information. That is why the 'editing instructions' as Juha calls it also need to be present in structured form. So, to me it isn't a question of 'should we' build this into the variant exchange model but rather 'how' .

That's fine, but would that be additional to the HGVS nomenclature or replacing it? Because by just looking at the previously mentioned link, I cannot see how VCF can handle more complex HGVS names.
Also, if data would be shared between two systems which both store the HGVS nomenclature, it doesn't seem to make sense for them to make a translation step twice; it would be easier to import the HGVS name and forget about interpreting the other format which was generated from the HGVS name anyway. But then again, I'm not familiar with how other databases store the variant internally, and how those systems support complex variants; maybe it's easier than I think it is ;)

Ivo - I would provide both. Recipient data consumers which prefer the HGVS 'smart' name and are aware of the semantics can use that, others (myself included) would treat the HGVS name as merely a label and use the structured version of the same information.

BTW You'll find that other systems generally store this information in a structured way (BioSQL, Bio::DB::GFF and Bio::DB::SeqFeature::Store in BioPerl, Chado in the GMOD collection etc.), not the least because a major requirement in this domain is to query a genome-wide database for certain kinds of features in a specific genomic interval. Which clearly is not a requirement for LOVD, as you only store the HGVS name.

Perhaps the structured version could be optional. I understand Ivo's point as well. Also conversion can be tricky for some cases. Would be interesting to know if there are cases where HGVS cannot be mapped to VCF or vice versa.

I have been recently checking our data with Mutalyzer. It is excellent tool and I am also getting to appreciate the HGVS nomenclature even more as well. Just wondering would it make sense to have simple REST api for the tool so that it can be used directly from applications. Useful e.g. for checking the name and getting both sequence fragments for further analysis.

My guess is that VCF can't handle big deletions, insertions, duplications or rearrangements, inversions are probably interpreted as deletions/insertions, and whole-exon deletions or duplications (even if relatively small) with unknown breakpoints in the flanking introns probably can't be stored. About the Mutalyzer API: LOVD currently uses a very basic interface to communicate to Mutalyzer. There are already plans to update it and improve it, but I really have no clue what that timeline would be.

My guess is that VCF can't handle big deletions, insertions, duplications or rearrangements, inversions are probably interpreted as deletions/insertions, and whole-exon deletions or duplications (even if relatively small) with unknown breakpoints in the flanking introns probably can't be stored. The other systems Mummi mentioned, I have never looked in to. Maybe these don't have those problems.

About the Mutalyzer API: LOVD currently uses a very basic interface to communicate to Mutalyzer. There are already plans to update it and improve it, but I really have no clue what the timeline would be.

Should these issues be tested with the VCF developers ? That might be useful for them as well.

Good news that there is API coming! Is that something where gen2phen could help somehow? Just keen on getting hand dirty with it asap ;-). BWT I am currently using Mutalyzer to get reference and mutated sequences and from there to web app (http://rapid.rcai.riken.jp/mutation/) to see effects and matching those with our annotations.