Login/Register
or use OpenID
lsdb xml data format
Background
Implementation
- RelaxNG compact and XSD schema files and examples from the svn
- Documentation (see also examples). New document and web site is under way
- UML model for XML schema (see other model versions here), TextUML Eclipse project and alignment to PaGE-OM
- Controlled vocabulary terms
- Minimal requirements
- XML schema version announcements/version log (post)
Examples
- See the svn repo
- Version 1.2 (old version 1.1)
- An AIREBase example (v1.2) and original flat file sample
- Old (initial) version
Summary discussions (note that the format examples are old)
Implementation examples (old format)
- Findis web-service : AMT, AGU (Use view source in your browser since results are in XML). Patient information is replaced by populations
- Printer-friendly version
- Login or register to post comments

Comments
I've made some changes to the format (please undo if necessary):
The variant id of LOVDs is numeric; more entries on the same DNA change are allowed and can contain other (predicted) effects on RNA or protein level. To reflect this, I changed the suggested IDs, also emphasizing that it will point to an LOVD installation, not to the LOVD website.
Because by changing the IDs the variants did not contain any actual information in them, I added the <name> tags with the DNA change, like in the previous version.
Also, to create unambiguity with the variant's <name> I changed the "name" attribute of the <variant> tag to "scheme".
Thanks. Looks better now. Good to also separate (non-semantic) ids and the variation names. How about that dbxref in refseq ? Should we use URIs for those as well as Mummi suggested ?
Quick note on the variant change string - in an XML representation of a variant you'll need to escape special symbols like > and <, for instance c.342G>C = c.342G>C
There are surely convenience functions in PHP for doing the escaping for you, e.g. like this in Perl using the CGI module ($escaped_string = escapeHTML("c.342G>C") ). On the receiving end, the client then un-escapes the string to get the original variant change.
Maybe it would be a good idea to include an URL to the reference sequence, yes... Not sure if we should also support non-official (non-submitted) reference sequence files that people put on the web, though.
But I think it shouldn't replace the ID and database_name. It should be an addition to those two fields. Or is that a symptom of me not wanting to let go of simple non-URL IDs just yet?
Mummi - yes, you're right. PHP has the htmlspecialchars() or htmlentities() for that (the latter has an antagonistic function).
Juha, good improvements on the format! Some new elements raise some questions, though:
<phenotypes> is not so easy to generate from LOVD. LOVD stores the phenotype information of the patient (possibly more than one disease) on the patient side. If multiple variants are found - actually with any number of variants found - LOVD can not be *sure* which variant causes which phenotype, even if I take into consideration the pathogenicity field, which stores the predicted and concluded pathogenicity of the variant. So maybe we should generate a patient element? Or maybe this field is meant differently?
LOVD does not have a field for <consequences>.
What does <phase> store? Doesn't seem like something LOVD stores by default, either.
The elements on the variant detection should be structured differently, I think. In LOVD 2.0 it is structured exactly like this, but because of various reasons we will structure it differently in LOVD 3.0. Now, you can't store separately if the mutation has been confirmed later with a different method or using a different template (like check on RNA or Protein level). You could group values like in LOVD 2.0 (template: DNA, RNA; technique: SEQ, RT-PCR) but that's still somewhat limiting. Perhaps we would store it similar to the <publications> part.
<variant><source>: what does this store? The source the consequence?
Also, I realised that the <name> element does not indicate the level (DNA, RNA, Protein). Should we include that? We could just parse the first character of the field to find the answer, too...
Thanks Ivo for comments! I have added patent element which has the phenotypes + other stuff from the requirements list you send. Is this OK?
About the consequence. Do you have any info what mutation does like frameshift, alters splicing etc.. ?
Phase was for cases where we have more than one mutation from same patient and we want to tell does the mutation come from same chromosome (same gene but different allele). Do you have this kind of info? Perhaps we can remove the element.
BTW MUTbase seems to have both alleles if data is available. See for example: http://bioinf.uta.fi/LIG1base/?content=pub/IDbases. Do you have this kind of data also ?
Source tells how the variation is observed.. ie. in this case computed (using mutalyzer). Not good to implement like that, thanks for pointing this. I have to think that and the detection template issue more. Perhaps "computational" can be stored into detection technique element ?
should we have type element ? DNA, RNA and AA ?
...we should take patient out from variant entry if we have more than one mutations in our databases:
<patient id=http://example.org/LOVD/patient/xyz >
<gender>male</gender>
<phenotypes>
...
</phenotypes>
<variants>
<variant id="http://example.org/LOVD/variants/00000001">
...
</variant>
<variant id="http://example.org/LOVD/variants/00000002">
...
</variant>
</variants>
</patient>
</variation_data>
or
<patient id="http://example.org/LOVD/patient/xyz" >
<gender>male</gender>
<phenotypes>
...
</phenotypes>
</patient>
<variant id="http://example.org/LOVD/variants/00000001" >
<patent ref= "http://example.org/LOVD/patient/xyz"
...
</variant>
<variant id="http://example.org/LOVD/variants/00000002" >
<patent ref= "http://example.org/LOVD/patient/xyz" >
...
</variant>
Hi Juha!
Firstly, apologies for taking so long. I will take up your suggestion to create a separate page on the mapping of the LSDB minimal requirements to LOVD 2 and the XML format. I will probably post that tomorrow.
First, on the XML (I've made some minor markup fixes to the specs).
I printed a part in italics (ethnicity and geographical_region) that are now duplicated; once within the patient element and once within variant. I think the part in italics can be removed?
The new element patient makes much sense but as you say we need to think about how to structure it. Actually, in the RESTful LOVD API everything will be separated in the Atom feed. So LOVD will output the Atom feed, with every entry containing a <variant> element. In such case, it would make sense to contain the <patient> element in the <variant> element (as it is now). Also, this is the way the current LOVD flat text import file looks like. Patient info repeated for every variant entry in that patient.
About the consequence; no we don't. The RNA/Protein level notations tell it all. We do have a field that can be enabled by users that lists the variant type on DNA level. But also here: it's basically included in the DNA field anyway.
About phase: but what would you fill in there, when one patient has three variants found; 2 on one allele, and 1 on the other? There is no "yes/no" case here. LOVD uses the Allele field for that (parental_origin in the XML). If that is the same for two variants, then those variants are in cis. Otherwise, in trans. That is, if Allele doesn't contain something like "Unknown".
The example you show for MutBase more or less represents our patient view:
Homozygous:
http://www.dmd.nl/nmdb2/variants.php?select_db=CAPN3&action=view&view=00...
Heterozygous, trans:
http://www.dmd.nl/nmdb2/variants.php?select_db=CAPN3&action=view&view=00...
Hererozygous, cis:
http://www.dmd.nl/nmdb2/variants.php?select_db=CAPN3&action=view&view=00...
But an export like that would be patient-based... in stead of variant-based...
About source: I don't think it should be in detection technique either. Computers don't find the variants. They might classify them, though. Mutalyzer (or SIFT, Align GVGD, UMD predictor, etc) might be the source of a suggestion of the effect of the variant, but the variant itself has been found with (Next generation) sequencing, RT-PCR, Melting curve analysis, etc. LOVD doesn't store it, but if you would want to include it in the XML I think it should go somewhere near consequence.
About the variant type element - do you think we could do:
<name type="DNA" scheme="HGVS">c.755G<A</name>
in stead of making seq_type a different element? If we want to replace <name> with more structured data like Mummi suggested, then we should keep the seq_type element I think...
Pfew... still lots of things, but I think we are getting somewhere :)
Thanks Ivo for good comments!
I have removed the duplicate elements.
I understand that you like to keep the "mutation centric" implementation for "mutation reports". Could this be turned into new element ? I.e. we could have mutation_report or variant_report which has the patient and mutation info:
<variant_report id="http:/lovd.org/idxxxx">
<patient id="http://lovd.org/patient/xyz" >
<gender>male</gender>
<phenotypes>
... ...
</phenotypes>
</patient>
<variants>
<variant id="00000001">
...
</variant>
<variant id="00000002">
...
</variant>
</variants>
</variant_report>
Is this OK ?
Should we remove the consequence or have it as optional ? Its is, as you said RNA/AA/DNA specific, and are included in variant elements which are also in that seq_change (related sequence changes) element.
Phase can be then removed and we use only parental origin element with has least following values
(which are actually from a Johan's email I got after workshop ):
0 = Unknown
1 = Parent #1
2 = Parent #2
10 = Paternal (inferred)
11 = Paternal (confirmed)
20 = Maternal (inferred)
21 = Maternal (confirmed)
Possible new values:
3 = de novo
13 = de novo, on paternal allele
23 = de novo, maternal allele
Perhaps we remove the source and use detection_template if needed in cases where variation is predicted using computational techniques. "Template" is then predicted or something. Variation elements which are under the seq_change element can have this attribute also (meaning that we have necessary annotation on all levels)
Looks better to have seq_type as an attribute as you wrote.
Hi Juha!
I think we might need to check the use cases on this. Do we have a list of things it is meant for? What is the purpose of this exchange format? Is it for LSDBs, to share (download/import) data with each other? Or for Gen2Phen databases (incl. DMuDB) to share (download/import) information with each other? Or for whatever client wants to access LSDBs through the API, for querying purposes only? Should it be possible for all LOVD data be put in there, because maybe I should be able to import it into another LOVD installation? I probably should've asked this before, as I am not familiar with the details for deliverable 3.7, but we're getting to a point where we're asking if the XML should be variant-centered or patient-centered... I think that's a rather important thing to have consensus on from the beginning.
So when reading my answers below, please keep the above in mind ;)
variant_report could work, but I think it's better to have just a bunch of <variant> elements in the <patient> element in stead of grouping both in a <variant_report> element. It saves us yet one more defined element.
Consequence: it depends on the implementation. Some systems will store it (or want to receive this information), but many LOVDs do not store it. So, my vote is for making it optional, or remove it (depending on what this XML format is for).
Yes, phase can be then, we have parental_origin. In the list you present, values 3, 13 and 23 (the bottom three values) are not in LOVD 2.0.
On source: can you give me an example when you use computational techniques for predicting variants? I thought you meant predicting the effect, but then it's really different from detection_template and detection_technique, they are meant to describe how the variant (mostly DNA level) was found in the patient, so really lab related stuff, not to predict the effect on protein level. That latter is more closely related to the aforementioned consequence element.
Thanks Ivo. Good to recheck use cases. Perhaps we should collect uses cases into wiki page? Do you like to make a start ? We could then prioritize cases.
Variant report might be useful if it has other info like submitter, contact details etc. which makes it meaning full. But let's get those uses cases first before dig into these. Also same thing with the other issues.
I meant cases where you simply look the DNA sequence and go from there to possible protein sequence and structure. This is sort of prediction right ? Perhaps better to keep things separate as you said.
Hi, I put an outline page on the wiki regarding use cases. Hopefully we can fill this out with more use cases and more details as time goes on. I think there are a few use cases where we would want a complete set of LSDB data to fit into the exchange format. For instance archiving, backup and restoration of complete LSDBs. We have a deliverable on WP7 about this.
Ok, I've adapted the format to some things we agreed on already and some last changes which are more or less suggestions by me.
- I removed <phase> as we both agreed.
- I moved <seq_type> to the attribute "type" in <name>. Possibly, we would want to make a difference between c. and g. contents; so possible values could be DNA/RNA/AA or DNA/cDNA/gDNA/RNA/AA.
- I restructured the <detection_technique> and <detection_template> to something more standard that would also fit in a larger XML model. Within <variant_detection> there is more space for multiple techniques with more data attached (such as protocols).
- I moved the element <ref_seq> up to right beneath the <name> just like the child <variant> elements.
- I moved the <patient> element down, to group all the variant's elements.
That's it... two more questions and two remarks:
- Is there (should there be?) a relationship between the <consequences> tag and the <seq_change> element?
- Should we try to indicate the relationship between the <consequence> and the <source> elements in the variant/seq_change/variant element?
- LOVD does not store the <consequence> field, so it should be optional.
- I think we should fix that the main <variant> element has a <consequences> element with <consequence> elements, but the "children" <variant> elements have only a <consequence> element.
Thanks for the update. Modifications looks good for me.
I was thinking that variant elements under the <seq_change> have also consequences and that is only way to annotate consequences on different levels. You may have different uses cases see last comment
We need also mechanism to flag gene products in case of splice variants. In MUTBase this is done by rnalink and dnalink attributes (see http://bioinf.uta.fi/LIG1base/?content=pub/IDbases) AFAIK. Also, if we have that option then the <source> element can be interpreted correctly.
OK for having multiple consequences on top level and one consequence on child elements... perhaps because people may have many "consequences" in some variants without the <seq_change> child components ? Or are there other reasons? If not, then we do not need to refer from the consequences to <seq_change> child elements (your first question) if I have understood this correctly.
The Mutation effect field in LOVD looks pretty much what I meant by the consequence. It has values like frameshift,
in-frame deletion and so on. Do we have controlled vocabulary for these terms already?
LOVD has also type of mutation which has values like insertion, deletion etc. Controlled vocabulary for these terms
would be useful as well, at least for query purposes. What others think ?
E.g. dbSNP has following types (or classes):
Strict single nucleotide polymorphism or "SNP"(1)
Insertion/deletion variation or "in-del"(2)
Unclassified heterozygous variations or "heterozygous"(3)
"Microsatellite"(4)
Named variation without allele sequence or "named-locus"(5)
"No variation" record(6).
"Mixed" variations(7)
"Multinucleotide polymorphism" (8)
I'm not sure anymore if I'm getting the <consequence> thing... Is the <consequences> element of the top <variant> element a summary of the <consequence> elements of the children <variant> elements? Is that why the top element is grouped, and the child elements are not? Maybe I don't understand because I don't understand the example given in the XML sample file "not enough space for E (...)".
In my head, the child <variant> elements can contain just as much data (and it should be in the same format) as the main <variant> entry (except the <patient> element), otherwise you'll end up with different meanings of the <variant> element. Or am I off, here?
Yes, it's a good idea to link protein level changes to RNA level changes. Linking RNA changes to DNA changes are unnecessary in this XML format as the RNA change (a <variant> element within <seq_change>) will be a child of the DNA change (the main <variant> element).
However, because one DNA change can give rise to two separate RNA changes, there will be two protein changes; one for each RNA change. It's a very good idea to link those protein changes to the causing RNA changes - however, LOVD does not store this connection directly and connect 100% reliable provide this information.
Maybe we can introduce this link by putting the protein <variant> element as a child in <seq_change> of the RNA change?
You wrote about a "Mutation effect" field. Unfortunately, it's not a standard LOVD column. So a user from the LOVD you saw this in, has created this column by himself.
LOVD does have a Variant/Type column by default (not enabled by default, though), that contains the type of variant on DNA level (Substitution, Deletion, Duplication, Insertion, Inversion, Insertion/Deletion, Translocation, Other/Complex). It's a very simple column and thus does not have a controlled vocabulary. In fact, people may edit this column to put whatever kind of values in there. In principle, what kind of mutation it is can be seen from the DNA field itself. But I agree that for query purposes it's nice to have this data. Actually, since LOVD 2.0-23 LOVD also stores this data when mapping a variant to the genome. At least for the first 6 types. Translocations and Other/Complex can't be mapped and we don't have a type for those variants.
The dbSNP list has some issues; an "in-del" can also be a "Multinucleotide polymorphism" and many types are missing; deletions, insertions, inversions, etc.
For the variation type, can we not use the Sequence Ontology (http://www.sequenceontology.org)? It is widely used for genome annotations. There's some coverage for variant effect as well.
I recall a brief discussion on this over E-mail last year I believe, with no real conclusion I don't think. Sure, there might be missing/incorrect things in there in places, but who better to help improve it than LSDB experts?
Thanks for telling. The example was perhaps bit bad. Its a consequence on protein structure level taken from our db. I agreed with your definition of "child" variants. They should have same (subset of) attributes with the same definitions.
Nesting of variants is good idea. I was thinking grouping as a way to do this. Here is example of two groups of RNAs and AAs:
<seq_change>
<variant type="RNA" group_id="1" >
...
</variant>
<variant type="AA" group_id="1" >
...
</variant>
<variant type="RNA" group_id="2" >
...
</variant>
<variant type="AA" group_id="2" >
...
</variant>
</seq_change>
But let's take the nesting approach!
Mutation effect is from Leicester "Osteogenesis Imperfecta" database. Would be nice to make a survey what kind of fields people have and then try to systemize those fields. Mutation effect field looks equal to our consequence field.
The dbSNP case illustrates importance of having proper ontology behind this. Hope end-users will give us feedback how important these annotations are.
Thanks Mummi for pointing this up. Bad memory...
Ok Juha, I have updated the XML schema to reflect the nesting we agreed on.
Sequence Ontology may be a nice addition, I don't know. I've briefly went through it but I don't have the time right now to really dive into it. If someone finds the time to see how it can be incorporated into this before the end of December, great. Unfortunately I can not be much of help anymore since two days from now I will be on symposium/holiday break until 5th of January. I will try to read my email and be active, but I cannot make any promises.
Thanks Ivo for the update and have a nice break! ... but if u still have time could you tell your opinion on having explicit location on reference genomic sequence or on multiple reference sequences (e.g. different sequence builds). This optional info could be helpful e.g. when visualizing the data.
I second the idea of using SO to describe variation types. As a result of our last year discussion unqualified terms like 'mutation' and 'polymorphism' were changed in favor of 'sequence_variant'. So if some terms need adding/changing SO dev team is quite responsive.
Here's a list of terms that could be of use:
http://www.ebi.ac.uk/ontology-lookup/termSearch.do?ontologyName=SO&inclu...
I'm happy to help with mapping a controlled vocab list to SO if necesseary.
Same. I made a wiki page for vocabulary terms: http://www.gen2phen.org/wiki/lsdb-controlled-vocabulary-terms
Ivo can you dig out what kind of values you have for pathogenicity ?