LSDB minimal requirements (D3.4), compared to LOVD and the LSDB XML format

See the current XML format: http://www.gen2phen.org/wiki/lsdb-xml-data-format and workshop notes: http://askja.gene.le.ac.uk/drupal5/content/lsdb-minimal-requirements

This list is a mapping between the LSDB minimal data requirements as listed in deliverable 3.4, LOVD 2.0/3.0 and the XML format currently under development (see link above). As the XML format will be developed further, this list will be updated.
Last update: 2010-01-21.

Name D3.4 says Availability in LOVD 2.0
Availability in LOVD 3.0 (planned) Cafe Rouge element or attribute currenly in XML format
Variant/Exon Recommended Standard, may be removed Standard, may be removed Recommended variant/exon
Variant/DNA_genomic Obligatory Not available, but genomic position can automatically be generated Always available (generated)
(Obligatory)* variant/name
Variant/DNA_coding Recommended Always available Always available (Obligatory)*

variant/aliases/variant/name

Variant/RNA Obligatory Always available Always available Recommended
(might not be available)
variant/seq_change/variant/name
Variant/Protein Obligatory Always available Always available Recommended
(might not be available)
variant/seq_change/variant/name
Variant/DBID Obligatory Always available Always available Should be generated by LSDB variant/id
Variant/Reference Obligatory Not standard LOVD; can be derived from Patient/Reference Undecided Obligatory variant/ref_seq
Variant/DNA_published Recommended Sometimes available Sometimes available Recommended variant/aliases/variant/name
Variant/Detection/Template Obligatory Not standard LOVD; can be derived from Patient/Detection/Template Always available; as Screening/Template Obligatory variant/variant_detection/template
Variant/Detection/Technique Obligatory Not standard LOVD; can be derived from Patient/Detection/Technique Always available; as Screening/Technique Obligatory variant/variant_detection/technique
Variant/DNA_remark Recommended Compatible with Variant/Remarks which is sometimes available Sometimes available; as Variant/Remarks Recommended variant/comment
Variant/Frequency Recommended Standard, may be removed Standard, may be removed Recommended variant/frequency
Variant/Origin Recommended Not standard LOVD, can only partially be derived from other (optional) columns Undecided Recommended variant/origin
Variant/Restriction_site Optional Standard, may be removed Standard, may be removed Optional variant/restriction_site
Variant/Allele Recommended Always available Undecided Recommended variant/parental_origin
Variant/Pathogenicity Recommended Always available Undecided Recommended variant/pathogenicity
Patient/Patient_ID Obligatory Non-public information Non-public information Obligatory patient/original_id
Patient/Phenotype/Disease Obligatory Always available Always available
Obligatory patient/phenotype
Patient/Remarks Recommended Standard, may be removed Standard, may be removed Recommended patient/comment
Patient/Origin/Geographic Recommended Sometimes available Sometimes available Recommended patient/population (type="region")
Patient/Origin/Ethnic Recommended Sometimes available Sometimes available Recommended patient/population (type="ethnic")
Patient/Gender Recommended Sometimes available Sometimes available Recommended patient/gender
ID_submitterid Obligatory Sometimes available (field can be empty, which means the curator is the submitter) Always available Obligatory source/id
Variant/HGNC gene Symbol Obligatory Obligatory variant/gene/accession and source="HGNC"
Variant/Sharing policy (public/private) Obligatory variant/sharing_policy (since release 1.4)
Variant/Use permission (default Creative Commons 0) Obligatory variant/use_permission (since release 1.4)
           
    Legend 
    Always available: Needs modification of LOVD to allow removal
    Standard, may be removed: Is enabled by default but users are allowed to remove these columns 
    Sometimes available: Is available in LOVD but not enabled by default; users can activate these columns 
    Not standard LOVD: Some LOVD's (especially Leiden-based) have these columns
    Undecided: May be same or similar as in LOVD 2.0, but we haven't decided on the exact implementation yet.
    (Obligatory)*: One of these fields needs to be present.

 

0
Your rating: None

Comments

Hi,
would it be possible to add a column for Cafe Rouge? It would allow us to reach a consensus on what is obligatory and recommended for this use case and progress on a functioning implementation.
I would see following:
Variant/Exon Recommended
Variant/DNA_genomic Recommended *can be generated by LSDB*
Variant/DNA_coding Recommended
Variant/RNA Recommended *might not be available*
Variant/Protein Recommended *might not be available*
Variant/DBID *Should be generated by LSDB*
Variant/Reference Obligatory
Variant/DNA_published Recommended
Variant/Detection/Template Obligatory
Variant/Detection/Technique Obligatory
Variant/DNA_remark Recommended
Variant/Frequency Recommended
Variant/Origin Recommended
Variant/Restriction_site Optional
Variant/Allele Recommended
Variant/Pathogenicity Recommended
Patient/Patient_ID Obligatory
Patient/Phenotype/Disease Obligatory
Patient/Remarks Recommended
Patient/Origin/Geographic Recommended
Patient/Origin/Ethnic Recommended
Patient/Gender Recommended
ID_submitterid_ Obligatory

Depending on the submission software at least one of:
Variant/DNA_genomic
Variant/DNA_coding
Variant/RNA
or Variant/Protein
should be included in the submission, what do you think?

Hi David, all of these attributes are included, if I have not forgotten something by mistake. We are happy to add more if needed. Do you have more attributes in your database ? Please do not hesitate to include those.

David, I have added a column in this table using the information you provided, but it seems now the end of the table is missing because it's getting too wide, so the readability of the table is not that great anymore. I will try what happens if I make the font smaller.

I personally would frown if a laboratory only tries to detect mutations on RNA or protein level and not on DNA level.
Also in LOVD, DNA is absolutely mandatory. If really needed, the HGVS schema can show the variant name is just predicted, like: c.(1234C>G)?
So I would say: include DNA (whichever one), RNA and Protein in the submission.

Hi Juha: we do not want to add more than there is already in LOVD, the idea behind adding a column was to clearly show what the Cafe Rouge platform is expecting (obligatory and recommended)

Hi Ivo: thanks for adding the column. I agree on the obligatory part for at least one of the DNA (either c. or g.) (is there a way to describe this in the table?)
For the protein: is it always possible to predict the protein change? What happens if the breakpoints are not clearly identified in rearrangements indels...?

Hello David! The table is used also for the XML format which goes beyond the Cafe Rouge. For the purposes it is good to add as much as possible, or at least to know what kind of data there are. It would help us to evaluate the format and possibly add new elements.

We would like to use eg the variation part in different contexts. For example now the format works for national mutation databases as well ( or at least for the findis), because I added place for gene name (database cross reference)

Another new features are: evidence_code asked by Mauno and source element which gives information on data sources. I have commited the new version to the svn http://www.gen2phen.org/post/lsdb-xml-schema

I have added missing bits into the table.

Hi Juha,
agree, we should add as much as possible to the format (but not as mandatory): the Cafe should act as an intermediate between as many as possible agents. I am just trying to reach a consensus on a minimal list of mandatory elements that diagnostics labs would submit to the LSDB 'world' Johan is working on creating.

Excellent! Good to get the info from the diagnostic lab side as well! We will keep most of the attributes optional.

Hi David: I've tried describing your suggestion in the table, using (Obligatory)* - I hope it's clear this way.
When RNA nor Protein has been analyzed, and the exact breakpoints are unknown (like almost all entries in the DMD whole-exon changes database), Johan describes them like: "p.(fsX)" or even "p.(?)". It may not seem much, but it's better than no value, because the values I just mentioned may indicate a prediction but at least show the Protein has not been analyzed.

Hi Juha: Thanks for your additions to the table. I have one question though: "Patient/Patient_ID" now is the "variant/local_id" XML element. Shouldn't that be in the patient element?

Hi Ivo,
you are right for p.: it's good to confirm that the protein has not been analyzed using "p.(?)"
If we all agree here on the 'Cafe Rouge' column content as it is right now, we should freeze it by end next week and get the other partners to agree to it as well through the science mailing list.

Thanks and fixed!

Hi All,

What's the intention with respect to the obligatory fields Variant/Detection/Template and Variant/Detection/Technique? In some instances, splicing defects may be apparent after analysis of mRNA that has been reverse transcribed, PCR amplified and subjected to agarose gel electrophroesis. However, the underlying variant leading to altered splicing is not found until genomic DNA is analysed by PCR amplification and DNA sequencing. In such a case, is the template mRNA or is it genomic DNA, or is it both. Also, there is no single technique that has been used. Does the schema allow for this?

Hello Raymond. Thanks for the question. The format should be able to handle that, if I understood you correctly, because the detection information can be added into all sequence levels. One thing we could add more is reference to detection protocol details, if that is needed.

Any chance we could color code this mapping to show required and optional and no-option field, and also to make conflicts apparent?

Also, the CaFE RouGE model has been updated recently, but not sure if that is represented in the sheet (Owen?)

 

 

  • Table updated.. unfortunately colors do not show for some reason. 
  • I still urge implementing the format ;-) That is best way to find conflicts or other errors
  • We currently have examples from: DMuDB and AIREbase
  • Tony, CaFE RouGE should use same format embedded into atom feed if I have not mistaken something...
Here's the core of a possible solution to limitations of this standard, as discussed in a face-2-face meeting between me (Tony, Juha and Myles on 1st July 2010...

* Keep just one 'Variant' class but add/adapt the following new attributes:
 
- Add 'AlleleicOrGenotypic' which can have values "Alleleic" or "Genotypic"
This will discriminate between these two forms of variant
It will also need a self-recursive relationship to capture alleleic variant to genotypic variant relationships
 
- Add 'DiploidCount' which can have any numeric value (not just integers!)
For alleleic variants, value 1 would equate to heterozygosity, value 2 to homozygosity, value 3 to trisomy, etc, and value 1.5 to a situation in a cancer sample where half the cells have lost one allele [so we would not put values like heterozygosity in the 'Origin' field].
For genotypic variants this same field would provide a way to capture the count of copy number variant.
 
- Rename 'Origin' to 'Genetic Origin' for values such as 'Unknown', 'de novo (certain)', 'de novo (inferred)', 'from mother (certain)', 'from mother (inferred)', 'from father (certain)', 'from father (inferred)', 'from either parent'  ...to precisely capture the genetic origin [and nothing more!]

* Have a distinct 'Pathogenicity' (rather than having this as an attribute of Variant), joined to Variant by a many-many relationship. It should have at least these attributes:
 
- 'Inferential Scope' to specify what set of individuals it refers to.
Values should include terms such as 'Individual', 'Family', 'Population', 'Population XYZ', 'Ethnic group', 'Ethnic group XYZ'
This should also be a Required field, as it is key for interpreting and integration pathogenicity statements
 
- 'Evidence Type', linked to an ontology such as Vario
 
- 'Evidence Text', free text describing the evidence

* Have an 'Observation_Target' superclass, with subclasses 'Individual' and 'Panel', with a many-many association between these two. Obviously, there is also a need for a many-many relationship between Observation_Target and Variant

THE ONLY PROBLEM I can see with the above, is how one then makes it clear which Observation_Target a particular Pathogenicty entry refers to, in situations where this connections needs to be recorded. The simplest solution would be to have an association link between Observation_Target and Pathogenicty. Alternative solutions can be imagined, but they are all far more complicated and so I won't go into them here.

XML STRUCTURE
-------------
 
We discussed patient-centric and variant-centric use cases, and how this would require opposite hierarchies of patient and variant. Can we not, however, specify one XML schema that includes sections for both of these two hierarchies. Users could then employ either (or both) according to need, and it would be simple for parsers to discriminate between the two.
 

I hope this takes us forward a little  :-)
 
Cheers
Tony
Hi All
In response to some concerns raised by Ivo by email, I am copying some further thoughts here...
The most important point to make, right away, is that the proposed changes do not actually further complicate the core model! Instead they just illuminate some inherent limitations to it, and solve these without adding extra classes. The key point involves realising that certain incompatible use cases are trying to be forced into one model (e.g., putting genotype observations in the same place as allele observations), and so these need to be teased apart - otherwise the model will be not be valuable for the community.
 
The answer is not, in my view, to run away from the model (to OWL, RDF, etc) and claim it is "too complicated" or "for nothing", but to see if it can be fixed ...and it can, and I believe I have suggested one way to achieve this. What is currently missing is a common appreciation by the group of the nature of the problems (and possible solutions). So lets please try to achieve that.
 
So lets try to clarify, point by point, based upon Ivo's questions....
 
>> - Add 'AllelicOrGenotypic' which can have values "Allelic" or
>> "Genotypic"
>> This will discriminate between these two forms of variant
>> It will also need a self-recursive relationship to capture allelic
>> variant to genotypic variant relationships
>
> What are these different two types? Do you have some examples?
>
...At a position where the mutant/normal alternatives are 'T/C', the 'T' alternative would be an 'allelic' variant. Furthermore, we have to conceptually discriminate between the 'T' as an allele in general (which will have certain features, such as frequency, and pathogenicity on average in the population), and a specific instance of the 'T' in an individual (which will other features, such as its zygosity and its pathogenicity in that individual).
 
...At a position where the mutant/normal alternatives are 'T/C', the 'T/C' observation (or 'T/T', 'T/C', 'T/T/C' alternatives, etc) would be a 'genotypic' variant. You could argue that all such alternatives could be captured by some clever/complicated use of the allelic variant concept, but this is not true when one thinks of copy number variants - these have no equivalent to an 'allele'. Copy number observations are all genotypes (e.g., genome count of 6), and must allow for fractional counts for situations where somatic variation exists in the sampled DNA (e.g., a cancer biopsy). As for 'allelic variants' we must conceptually distinguish between general and individual specific occurrences of 'genotypic variants'.
 
...'Allelic' and 'Genotypic' variants obviously need to have a cross-relationship, and hence we need a self-recursive association for the variant class.

>> - Add 'DiploidCount' which can have any numeric value (not just
>> integers!)
>> For allelic variants, value 1 would equate to heterozygosity, value 2
>> to homozygosity, value 3 to trisomy, etc, and value 1.5 to a situation
>> in a cancer sample where half the cells have lost one allele [so we
>> would not put values like heterozygosity in the 'Origin' field].
>> For genotypic variants this same field would provide a way to capture
>> the count of copy number variant.
>> 
>> - Rename 'Origin' to 'Genetic Origin' for values such as 'Unknown',
>> 'de novo (certain)', 'de novo (inferred)', 'from mother (certain)',
>> 'from mother (inferred)', 'from father (certain)', 'from father
>> (inferred)', 'from either parent'  ...to precisely capture the genetic
>> origin [and nothing more!]
>
> What do you put in the origin field for a homozygous mutation? And how
> would you store a mutation that was inherited from the father, but is
> the novo on the chromosome that come from the mother? I'm not in favor
> of grouping homozygous mutations; they have two different sources so my
> gut tells me they should be stored as two separate mutations.
>
...This question illustrates the need for a clear distinction between allelic variants in general, allelic variants in individuals, genotypic variants in general, and genotypic variants in individuals. For your specific example, you're talking about a use case that requires patient specific data recording (rather than population level), and so there would be one variant entry, as follows;
'AllelicOrGenotypic' = "Allelic"
'DiploidCount' = 2.0
'Genetic Origin' = "de novo (certain)" and "from father (certain)"
 
...How would you do it in LOVD?
 
...And how would you do differently for an alternative patient who was a compound heterozygous (two different recessive pathogenic mutations in same gene, causing disease). My solution would be to have two variant entries
'AllelicOrGenotypic' = "Allelic"
'DiploidCount' = 1.0
'Genetic Origin' = "de novo (certain)"
'AllelicOrGenotypic' = "Allelic"
'DiploidCount' = 1.0
'Genetic Origin' = "from father (certain)"

>> * Have a distinct 'Pathogenicity' (rather than having this as an
>> attribute of Variant), joined to Variant by a many-many relationship.
>> It should have at least these attributes:
>> 
>> - 'Inferential Scope' to specify what set of individuals it refers to.
>> Values should include terms such as 'Individual', 'Family',
>> 'Population', 'Population XYZ', 'Ethnic group', 'Ethnic group XYZ'
>> This should also be a Required field, as it is key for interpreting
>> and integration pathogenicity statements
>
> This really confuses me... If I understand correctly, this will lose all
> information on the variants separately? So variants are linked to a
Pathogenicity statement that refers to an individual or even more
> abstract, a population? But, a pathogenicity class linked to an
> individual sounds like phenotype. I understand that there are issues
> with the current setup (two pathogenic variants can be non-pathogenic
> when combined, dominant/recessive, imprinting, etc), but I do believe we
> need to keep the pathogenicity info per variant as well.
> If this is how I understand it is, LOVD will not be able to generate
> these values.
>
...I am glad you realise the current system has major problems. But saying that "we need to keep the pathogenicity info per variant" reflects a lack of consideration of the above issues regarding what a 'variant' actually means. This is all completely resolved by pulling 'pathogenicity' into its own class, with an 'Inferential Scope' attribute. E.g., 
 
* allelic & genotypic variants in general
'Inferential Scope' might refer to "all populations", or "Africans", or "Ashkenazi Jews" etc, depending on what the evidence was. There may be several pieces of pathogenicity evidence, potentially based upon different ontologies, and so we'd need a separate pathogenicity class to capture all of these.
 
* allelic & genotypic variants in individuals
'Inferential Scope' might refer to "the patient", or "the patients family", depending on what the evidence was. Again, there may be several pieces of pathogenicity evidence, potentially based upon different ontologies, and so we'd need a separate pathogenicity class to capture all of these.
 
A key point here, is that pathogenicity in the individual might be 'causative' in the patient (as evidenced from linkage studies, and/or functional genomics tests on that patients biosamples), but 'variably causative' in the population as a whole (e.g., due to genetic background differences, or different environments), and even 'non-pathogenic' in other populations. We simple must discriminate between these inferential scopes.

>> * Have an 'Observation_Target' superclass, with subclasses
>> 'Individual' and 'Panel', with a many-many association between these
>> two. Obviously, there is also a need for a many-many relationship
>> between Observation_Target and Variant
>
...These many-many relationships are surely needed? An individual may be part of several panels (families, populations, groups affected by disease, or whatever 'panel' clustering might be needed for a certain use case), and a panel will have many individuals. Equally, each target (e.g., patient) may have many variants, and each allelic & genotypic variants in general might be found in many individuals.

> How many times will a variant, with all information associated to it,
> actually be referenced throughout several individuals? These
> many-to-many relationships in an XML file make the file quite
> complicated. I've never seen it before either; data always get repeated.
> Simple example:
>
> <entry id="001">
>   <created_by>
>     <id>1</id>
>     <name>Ivo Fokkema</name>
>   </created_by>
>   <edited_by>
>     <id>1</id>
>     <name>Ivo Fokkema</name>
>   </edited_by>
> </entry>
> <entry id="002">
>   <created_by>
>     <id>1</id>
>     <name>Ivo Fokkema</name>
>   </created_by>
>   <edited_by>
>     <id>1</id>
>     <name>Ivo Fokkema</name>
>   </edited_by>
> </entry>
>
> Yes, the user information gets repeated all the time. But introducing
> many-to-many relationships and therefore splitting data in the XML file
> will, in my opinion, make the XML document too complicated. If we're
> connecting data that way, we might as well just send JSON objects back
> and forth and drop the whole XML stuff. Well, probably nobody will agree
> with me there ;)
>
> Yes, the user information gets repeated all the time. But introducing
> many-to-many relationships and therefore splitting data in the XML file
> will, in my opinion, make the XML document too complicated. If we're
> connecting data that way, we might as well just send JSON objects back
> and forth and drop the whole XML stuff. Well, probably nobody will agree
> with me there ;)
 
...I think you're making it all too complicated. There should be no need to repeat things unnecessarily. Surely the answer would be:
 
<author>
  <id>1</id>
  <name>Ivo Fokkema</name>
</author>
<entry id="001">
  <created_by>
    <id>1</id>
  </created_by>
  <edited_by>
    <id>1</id>
  </edited_by>
</entry>
<entry id="002">
  <created_by>
    <id>1</id>
  </created_by>
  <edited_by>
    <id>1</id>
  </edited_by>
</entry>
 
...But there is one fundamental issue we do need to solve regarding the XML. For 'patient focussed' LSDBs the XML will need to cluster variants under patients. Whereas many data sharing objectives will need to cluster patients under variants (researchers will want to receive all information/patients observed for one or a few variants, not everything seen in those patients). Equally, 'population focussed' LSDBs will want to cluster info under variants in their downloads. To solve this, I suggested the XML spec allows for both alternatives - though each XML data file would, of course, only use one of the two.

Cheers
Tony