D7.2 Archives Established from Federated LSDBs
| Contributed by: | Acacia Reiche |
| Originally posted: | 12th August 2010: 12:20 pm |
| Last updated: | 1st July 2011: 11:52 am |
| Short URL: | http://gen2phen.org/node/25705 |
| Attachment | Size |
|---|---|
| D7.2_Archives Established from Federated LSDBs_final.pdf | 203.94 KB |
Embedded Scribd iPaper - Requires Javascript and Flash Player
HEALTH-F4-2007-200754 www.gen2phen.org
D7.2 Archives Established from Federated LSDBs
WP7 – DATA FLOWS
V1.3 Final
Lead beneficiary: UNIMAN Date: 11/08/2010 Nature: Report Dissemination level: PU
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
2/17
TABLE OF CONTENTS DOCUMENT INFORMATION ..................................................................................................3 DOCUMENT HISTORY .............................................................................................................3 DEFINITIONS ..............................................................................................................................4 1. 2. 3. 4. 5. EXECUTIVE SUMMARY ..................................................................................................5 INTRODUCTION.................................................................................................................5 OVERVIEW OF EXISTING LSDBS .................................................................................5 METHODOLOGY FOR IDENTIFYING LSDBS ............................................................7 POSSIBLE STRATEGIES FOR SECURING DATA ....................................................12 5.1. 5.2. 5.3. 5.4. 5.5. 6. 7. HOSTING A COPY OF THE DATABASE ON A GEN2PHEN SERVER. ..................................12 MIRROR DATABASES......................................................................................................12 CONVERT DATABASES TO LOVD, MUTBASE OR UMD FORMAT...................................12 MINIMAL ARCHIVING ....................................................................................................12 ARCHIVING IN CENTRAL REPOSITORIES ..........................................................................12
ACKNOWLEDGING THE ORIGINAL LSDBS. ...........................................................15 FUTURE WORK ................................................................................................................15 7.1. 7.2. 7.3. EXCLUSION OF DATABASES FROM ARCHIVING. ..............................................................15 OTHER STRATEGIES FOR IDENTIFYING LSDBS...............................................................15 REPEATING THE ARCHIVING PROCESS. ...........................................................................16
8. 9.
SUMMARY .........................................................................................................................16 REFERENCES....................................................................................................................16
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
3/17
Document Information
Grant Agreement HEALTH-F4-2007-200754 Number Full title Project URL Acronym GEN2PHEN
Genotype-To-Phenotype Databases: A Holistic Solution http://www.gen2phen.org
EU Project officer Dr. Iiro Eerola (Iiro.EEROLA@ec.europa.eu) Deliverable Work package Delivery date Status Nature Dissemination Level Report Public Number 7.2 Number 7 Contractual Title Title Month 30 Archives Established from Federated LSDBs Data Flows Actual Final Other 11/08/2010
Version 1.3 Prototype Confidential
Authors (Partner) M. Cornell, A. Devereau (UNIMAN) Responsible Author M. Cornell Partner UNIMAN Email michael.cornell@cmft.nhs.uk Phone +44 (0)161 2768716
Document History
Name Date Version Description
M. Cornell (UNIMAN) M. Cornell (UNIMAN) M. Cornell (UNIMAN) M. Cornell (UNIMAN)
12/7/10 21/7/10 06/08/10 11/08/10
V1.0 V1.1 V1.2 V1.3
Document creation Document revised after reviewers comments Consortium Review Document revised after consortium review
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
4/17
Definitions
Partners of the GEN2PHEN Consortium are referred to herein according to the following codes: ULEIC – University of Leicester (UK) – Coordinator EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary FIMIM – Fundació IMIM (Spain) – Beneficiary LUMC – Leiden University Medical Center (Netherlands) – Beneficiary INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary KI – Karolinska Institutet (Sweden) – Beneficiary FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary CEA – Comissariat à l’Energie Atomique (France) – Beneficiary EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary UAVR – Universidade de Aveiro (Portugal) – Beneficiary UWC – University of the Western Cape (South Africa) – Beneficiary CSIR – Council of Scientific and Industrial Research (India) – Beneficiary SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary UNIMAN – The University of Manchester (UK) – Beneficiary BIOBASE – BioBase GmbH. (Germany) – Beneficiary deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary PHENO – Phenosystems S.A. (Belgium) – Beneficiary BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary UPAT – University of Patras (Greece) – Beneficiary
Grant Agreement: The agreement signed between the beneficiaries and the European Commission for the undertaking of the GEN2PHEN project (HEALTH-200754). Project: The sum of all activities carried out in the framework of the Grant Agreement by the Consortium. Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant Agreement. Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal entities. Consortium agreement: Agreement concluded amongst GEN2PHEN participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
5/17
1. Executive Summary
This deliverable describes the development of a strategy for identifying public locus-specific databases (LSDBs) that require archiving to ensure that they are not lost. The classification of LSDBs produced as part of D2.3 (Technical State-Of-The-Art Document for G2P Databases) was used to identify a set of 73 LSDBs which require archiving to secure the data. A policy for securing these LSDBs, depending on the software used to develop them has been developed.
2. Introduction
There has been recent rapid growth in the number of publicly available locus-specific databases (LSDBs). In 2008, Cotton et al., noted that there were over 700 LSDBs while a review of LSDBs in 2009, by GEN2PHEN identified 1,188. With the development of high-throughput sequencing methodologies it seems reasonable to expect that, alongside further increases in the number of LSDBs, there will be a large increase in the reporting of variants. This will certainly be the case if the recent recommendation (Greenblatt et al., 2008) that all occurrences of variants are entered in LSDBs is followed. The storage of variant data in LSDBs has the advantage that it allows curation by an expert in that gene or disease. However, the disadvantage, compared to large central databases, is that many databases need to be maintained and therefore the risk of data loss is greater. This might occur for several reasons, including loss of funding or staff or computer system failures. Because of the clinical importance of variant data it is important that LSDBs are secured to prevent loss of data. The number of databases already in existence means that this will prove a major task and there is a need for systematic methods for identifying databases in need of archiving.
3. Overview of existing LSDBs
This work builds upon the domain analysis of LSDBs previously undertaken as part of WP2 (D2.3 Technical State-Of-The-Art Document for G2P Databases, Mitropoulou et al., 2010), in which the structure and content of each LSDBs was analyzed. This classification of LSDBs has allowed us to develop a systematic methodology for screening databases to identify those that might be at risk. For each database, the year of last update, software used to build the database, the gene and the database URL was listed. In addition, each database was classified according to its attributes as listed in Table 1.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
6/17
Database Attributes Disclaimer Collection via literature Contact curators Reference sequence Check Summary table listing all mutations HTML (no search option) Complete reference list Summary phenotypic description Links to references Restriction enzyme change shown Explanation of content and aim Useful links Links to OMIM Links to HGMD Copyright Online submission Chromosome location Information on gene HGVS Nomenclature Relational database Use of specific DBMS Ethnic group Detection method Querying tool(s) Field:Mutation name Field: Gene region Field: Codon number Field: Author name Field: Phenotype Field: Ethnic group Field: Geographic location Other fields for querying Counter Language other than English Information about disease List of associations Protein function Db description published Downloadable mutation table Mutation frequency Flat-file database Detailed phenotypic description Mutation visualization tool Cross-reference with other databases Links to other LSDBs
Number of Databases 934 1183 1161 1033 1183 1034 194 1039 882 1123 772 1024 1080 905 657 960 888 882 981 904 978 788 829 682 908 795 747 737 815 818 709 694 839 42 6 250 230 29 266 125 19 11 95 159 18 4
Table1: Attributes associated with public LSDBs from classification in D2.3. Attributes in italics have been used for identifying LSDBs requiring archiving. Three software types are listed in the LSDB classification: LOVD (589 instances), UMD (20 instances), MUTbase (115 instances). In addition, there are many LSDBs (454 instances) for which the software used to create the database is unknown (i.e. it could not be ascertained from
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
7/17
the LSDB website). Some of the LSDBs classified as unknown will not actually be databases in the sense that they do not have a DBMS (database management system). For example, the variant data may be listed within a table in the webpage and updating the database might mean adding more rows to the HTML table.
4. Methodology for identifying LSDBs
In order to decide which databases should be secured we have proposed the following list of questions: 1. How long ago was the database last updated? If a database is not being updated, it might indicate that the database is no longer being maintained. A total of 286 LSDBs (25% of all LSDBs) had not been updated in 2008 or 2009. However, it should be borne in mind that for some genes variants are less frequently reported. This may explain why an LSDB has not recently been updated. 2. Is the LSDB likely to be of clinical importance? If an LSDB is used as part of the analysis of genetic test results then loss of the database may have an impact on patients. The UK Genetic Testing Network website (http://www.ukgtn.nhs.uk/gtn/Home) lists the genes for which there are genetic tests within the UK. This list was used to determine whether an LSDB was likely to be of clinical importance. 3. Are there other LSDBs for this gene? If there are multiple LSDBs for the same gene then there is an alternative source of information about variants. 4. Are the variants named using the correct HGVS nomenclature? If the correct nomenclature (den Dunnen and Antonarakis 2001) has not been used, the database may not be as easy to search and would therefore be less useful. 5. Is the reference sequence used to name the variants given? This is of particular importance if there are changes that have been made to the sequence which would affect the variant names. Based upon these questions the decision tree shown in Figure 1 was developed. The LSDB classification from D2.3 was converted to tab delimited text format and a simple Java program was used to determine whether the LSDB should be archived according to the criteria shown in the decision tree. A total of 73 LSDBs were identified as requiring archiving. These are listed in Table 2.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
8/17
Figure 1: Decision tree used to identify LSDBs for archiving
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
9/17
Database ID ld11 ld16 ld36 ld46 ld72 ld86 ld88 ld102 ld145 ld223 ld224 ld225 ld226 ld227 ld228 ld229 ld230 ld231 ld232 ld241 ld243 ld285 ld356 ld375 ld376 ld377 ld390 ld403 ld422 ld473 ld474 ld611
Gene ACTC1 ADSL AP3B1 AQP2 ASS1 AVP AVPR2 BEST1 CASR CRYAA CRYAB CRYBA1 CRYBA4 CRYBB1 CRYBB2 CRYBB3 CRYGC CRYGD CRYGS CTSC CXCR4 DMD FBN2 FOXL2 FOXL2 FOXN1 G6PD GFI1 GM2A HEXA HEXB L1CAM
URL http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.icp.ucl.ac.be/adsldb/ http://bioinf.uta.fi/AP3B1base/ http://www.medicine.mcgill.ca/nephros/ http://chromium.liacs.nl/LOVD2/home.php?select_db=ASS1 http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www-huge.uni-regensburg.de/VMD2_database/ http://www.casrdb.mcgill.ca http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYAA http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYAB http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBA1 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBA4 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB1 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB2 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB3 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGC http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGD http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGS http://bioinf.uta.fi/CTSCbase/ http://bioinf.uta.fi/CXCR4base/ http://www.umd.be/DMD/ http://www.umd.be/FBN2/ http://medgen.ugent.be/LOVD2/home.php?select_db=FOXL2 http://medgen.ugent.be/foxl2/ http://bioinf.uta.fi/FOXN1base/ http://www.bioinf.org.uk/g6pd/ http://bioinf.uta.fi/GFI1base/ http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.rug.nl/umcg/faculteit/disciplinegroepen/medischegenetica/hereditarydi...
Database software Unknown Unknown MUTbase Unknown LOVD Unknown Unknown Unknown Unknown LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD MUTbase MUTbase UMD UMD LOVD Unknown MUTbase Unknown MUTbase Unknown Unknown Unknown Unknown
Last Update 2004 2001 2007 2003 2007 2003 2003 2007 2003 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2003 2003 2003 2006
UKGTN gene FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
Number of LSDBs for this gene 2 1 3 1 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 3 1 1 1 1 1 1 3
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
http://mecp2.chw.edu.au/ http://lsdb.hgu.mrc.ac.uk/home.php?select_db=MLYCD http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.umd.be/MYO7A/ http://ureacycle.cnmcresearch.org/otc/ http://www.dbpex.org/home.php?select_db=PEX13 http://www.dbpex.org/home.php?select_db=PEX14 http://www.dbpex.org/home.php?select_db=PEX16 http://www.dbpex.org/home.php?select_db=PEX19 http://www.dbpex.org/home.php?select_db=PEX2 http://www.dbpex.org/home.php?select_db=PEX26 http://www.dbpex.org/home.php?select_db=PEX3 http://www.dbpex.org/home.php?select_db=PEX6 http://bioinf.uta.fi/RAC2base/ http://bioinf.uta.fi/RFX5base/ http://bioinf.uta.fi/RFXANKbase/ http://bioinf.uta.fi/RFXAPbase/ http://www.retina-international.org/sci-news/rpgrmut.htm http://bioinf.uta.fi/SLC35C1base/ http://bioinf.uta.fi/SP110base/ http://bioinf.uta.fi/STAT5Bbase/ http://bioinf.uta.fi/STX11base/ http://bioinf.uta.fi/TAP1base/ http://bioinf.uta.fi/TAP2base/ http://bioinf.uta.fi/TAPBPbase/ http://bioinf.uta.fi/TCN2base/ http://genoma.ib.usp.br/TCOF1_database/index.php http://www.umd.be/TGFBR2/ http://bioinf.uta.fi/TNFRSF13Bbase/ http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://bioinf.uta.fi/TYK2base/ http://bioinf.uta.fi/UNGbase/
10/17
Unknown LOVD Unknown Unknown Unknown UMD Unknown LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD MUTbase MUTbase MUTbase MUTbase Unknown MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase Unknown UMD MUTbase Unknown Unknown Unknown Unknown MUTbase MUTbase 2001 2007 2004 2004 2004 2007 2007 2006 2006 2006 2007 2007 2007 2006 2007 2007 2007 2007 2007 1999 2007 2007 2007 2007 2007 2007 2007 2007 2004 2007 2007 2004 2004 2004 2004 2007 2007 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE 1 2 2 2 2 3 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 1 1
ld673 ld688 ld711 ld713 ld716 ld724 ld783 ld816 ld817 ld818 ld819 ld822 ld821 ld823 ld825 ld906 ld915 ld916 ld917 ld935 ld998 ld1022 ld1039 ld1040 ld1050 ld1052 ld1053 ld1066 ld1067 ld1073 ld1087 ld1091 ld1093 ld1102 ld1117 ld1121 ld1133
MECP2 MLYCD MYBPC3 MYH7 MYL2 MYO7A OTC PEX13 PEX14 PEX16 PEX19 PEX2 PEX26 PEX3 PEX6 RAC2 RFX5 RFXANK RFXAP RPGR SLC35C1 SP110 STAT5B STX11 TAP1 TAP2 TAPBP TCN2 TCOF1 TGFBR2 TNFRSF13B TNNI3 TNNT2 TPM1 TTN TYK2 UNG
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
http://www.umd.be/USH1C/ http://www.umd.be/USH1G/ http://bioinf.uta.fi/WASbase/ http://bioinf.uta.fi/ZAP70base/index.php
11/17
UMD UMD MUTbase MUTbase 2007 2007 2004 2007 FALSE FALSE TRUE FALSE 2 1 1 1
ld1137 ld1138 ld1151 ld1167
USH1C USH1G WAS ZAP70
Table 2: LSDBs identified as requiring archiving using the decision tree in Figure 1.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
12/17
5. Possible strategies for securing data
The following five strategies could be used for archiving and securing LSDBs. 5.1. Hosting a copy of the database on a GEN2PHEN server. This solution would allow users to access the database in the event of the original database being lost. In practice a copy is made of the original database and hosted on a GEN2PHEN server. There is no further link between the original database and the copy. This might be an appropriate solution if we have identified that the database is significantly out-of-date and is not likely to be updated. In such cases there is little danger of our database being out of sync with the original. However there are still several issues to be considered: • Do we know which DBMS the original database is using? • Is the DBMS compatible with software in use? • Is the database running on an old version of the DBMS? If so, how much work will be required to get it to run on an up-to-date version? • What software has been used to provide web access to the original database? 5.2. Mirror databases. Mirroring databases on other servers would prevent data loss and would also provide backup in instances where a server fails. As with 5.1 this would require detailed knowledge of the DBMS used for the original database. In addition, since LSDBs which we would consider archiving are, by definition, not being updated it’s not clear whether this strategy would provide any advantages over 5.1. 5.3. Convert databases to LOVD, MUTbase or UMD format. This option provides an additional benefit. By converting data into LOVD, MUTbase or UMD formats the data will not only be secured, it will also allow the data to be integrated with other variant data using software being developed by GEN2PHEN. However, again there are issues that need to be considered. • How easy will it be to map tables in the database to LOVD or UMD models? • Will each conversion of a database into LOVD or UMD format require a new set of programs to map the data into the new formats? 5.4. Minimal Archiving Data is archived so that while it is not publicly available, it can be kept secure in case the worst happens and the original database is lost. This is the minimum level of securing data. Because data will not be made public, unless the original data source is lost, there is no requirement to obtain the permission of the original database owners. 5.5. Archiving in central repositories As an alternative to maintaining a new version of an existing LSDB, the variant data could be submitted to a central repository, such as Ensembl or dbSNP. Mechanisms for submitting variant data to these repositories are in the process of being established.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
13/17
Of the 75 LSDBs shown in Table 1, 21 use LOVD software, 22 MUTbase, 6 UMD and 24 use unknown database software. LOVD, MUTbase and UMD software have been developed by GEN2PHEN partners and therefore strategies described in 5.1 may be appropriate. For those LSDBs which use unknown database software, this option does not appear viable. For some of these databases there may be no DBMS. For example, the TCOF1 database (http://genoma.ib.usp.br/TCOF1_database/index.php) does not appear to have a DBMS, the data is written in the HTML. In these instances, the suggested strategy is: 1. Perform minimal archiving, by “screen scraping” the data. At present this data is secured at University of Manchester. 2. Convert the minimal archived data to LOVD format, with permission from the original data owners. 3. Once the data has been converted to LOVD format, the database will then be secured under the same policies used for other LOVD instances maintained by LUMC. For most of these databases an empty LOVD database already exists, created as part of deliverable 4.4. Table 3 lists the 24 “unknown software” LSDBs and the action taken. Three of the databases appear to be using LOVD software, although one of these (ld376) cannot be accessed. In addition, the FHC Mutation Databases (ld11, ld16, ld46, ld86, ld88, ld102, ld145 and ld376) could not be found.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
14/17
ID ld11 ld16 ld46 ld86 ld88 ld102 ld145 ld376 ld390 ld422 ld473 ld474 ld611 ld673 ld711 ld713 ld716 ld783 ld935
Gene ACTC1 ADSL AQP2 AVP AVPR2 BEST1 CASR FOXL2 G6PD GM2A HEXA HEXB L1CAM MECP2 MYBPC3 MYH7 MYL2 OTC RPGR
URL http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.icp.ucl.ac.be/adsldb/ http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www-huge.uni-regensburg.de/VMD2_database/ http://www.casrdb.mcgill.ca http://medgen.ugent.be/foxl2/ http://www.bioinf.org.uk/g6pd/ http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.rug.nl/umcg/faculteit/disciplinegroepen/medischegenetica/her editarydiseases/l1cam/index http://mecp2.chw.edu.au/ http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://ureacycle.cnmcresearch.org/otc/ http://www.retina-international.org/sci-news/rpgrmut.htm
LOVD exists Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes
ld1067 ld1091 ld1093 ld1102 ld1117
TCOF1 TNNI3 TNNT2 TPM1 TTN
http://genoma.ib.usp.br/TCOF1_database/index.php http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html
Yes (empty) Yes (empty) Yes (empty) Yes (empty) Yes
Action 404 not found minimal archiving completed minimal archiving completed minimal archiving completed minimal archiving completed LOVD database - minimal archiving completed minimal archiving completed LOVD but can't access (500 internal server error) minimal archiving completed minimal archiving completed minimal archiving completed minimal archiving completed appears to now be LOVD and is being updated. No action necessary minimal archiving completed 404 not found 404 not found 404 not found minimal archiving completed minimal archiving completed – variants don’t appear to use correct HGVS nomenclature minimal archiving completed 404 not found 404 not found 404 not found 404 not found
Table 3: Databases using unknown software
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
15/17
6. Acknowledging the original LSDBs.
It is important that if an LSDB is archived by creating an LOVD database to replicate the data that we give credit to the curators of the original database. Therefore we will: • Seek permission from the database curators before making the LOVD version publicly available. • If permission is not forthcoming we will create an archive of the database but will not make it public via an LOVD unless the original database becomes unavailable. • We will invite the curators of the original database to be curators of the new database. • We will provide a link to the original database (if still available).
7. Future Work
7.1. Exclusion of databases from archiving. As discussed in section 4, the methodology that has been developed for identifying databases for archiving excludes databases which do not list reference sequences and do not use correct HGVS format. As shown in Table 4, the effect of this policy is to exclude most of the “unknown” databases. This is clearly not an ideal situation since it is the unknowns which tend not to be updated and which appear most in danger of being lost. LSDBs LOVD UMD MUTbase Unknown 589 20 115 454 Not HGVS 1 0 0 279 No Reference Sequence 0 10 0 141 Total Excluded 1 10 0 342
Table 4: Numbers of LSDBs excluded from archiving because they do not follow HGVS nomenclature and/or do not list the reference sequence used. The problem of missing reference sequence may be solved by software being developed at NGRL Manchester. This allows a potential reference sequence to be tested against variants to see whether it could have been the reference sequence used to name those variants. In addition, the Mutalyzer Batch Checker (http://www.mutalyzer.nl/1.0.4) developed by LUMC allows the user to check all HGVS descriptions in combination with all potential reference sequences. 7.2. Other strategies for identifying LSDBs. The strategy adopted here to identify LSDBs for archiving used comparatively few of the database attributes identified in the D2.3 classification. It has been chosen for practicality and has allowed us to identify a set of LSDBs that require archiving and develop a set of policies for securing them. After completion, we may wish to extend this and develop alternative strategies. For example, it might be desirable to ensure that databases with ethnicity data are archived. Because of the systematic way in which the classification has been produced, alternative strategies would be easy to develop.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
16/17
7.3. Repeating the archiving process. Clearly, the archiving process should not be a one-off. LSDBs that have not been identified as in danger in this review might be in danger next year. It is intended that the manual review of LSDBs will be repeated, perhaps yearly. If so then the process of identifying LSDBs that are at risk and in need of archiving should be repeated. In addition, LUMC has developed an automated system for surveying LOVD installations which may identify those in need of archiving.
8. Summary
• Because of the growing number of LSDBs and the increasing volumes of data being generated, the archiving of LSDBs represents a huge challenge and it is important that systematic methods for identifying “at risk” LSDBs are developed. In this deliverable we have presented the first effort at developing such a systematic methodology. We have been able to develop this method because of previous efforts by GEN2PHEN members in developing the classification of LSDBs. This deliverable demonstrates the importance of developing the LSDB classification and underlines the need for the classification to continue. In the period between the classification of LSDBs (D2.3) and this piece of work, several of the “at risk” LSDBs appear to have been lost. This loss may not be permanent, the databases may just have been moved, but for anyone who used and relied on these databases the effect is the same. This underlines the need for data archiving strategies. Some of this data has already been saved. The data from the FHC Mutation Database (http://www.angis.org.au/Databases/Heart/heartbreak.html), which gave a 404 “not found” error have been secured by LUMC and are currently being added to the Leiden Muscular Dystrophy pages (genes ACTC1, MYBPC3, MYH7, MYL2, TNNI3, TNNT2, TPM1, TTN). This deliverable demonstrates the need for data standards. At present, databases which are not correctly naming variants and list reference sequences are excluded from archiving. For these LSDBs to be included will take considerable effort. It is clear that LSDBs developed using the LOVD and MUTbase “in a box” database solutions do not suffer from these problems. The development of this software appears to have helped to establish good practice.
•
•
•
•
9. References
Cotton R.G.H et al., 2008. Recommendations for locus-specific databases and their curation. Human Mutation. 29: 2 - 5. den Dunnen JT, Antonarakis SE. 2001. Nomenclature for the description of human sequence variations. Hum Genet 109:121–124. Greenblatt M.S. et al. 2008. Locus-Specific Databases and Recommendations to Strengthen Their Contribution to the Classification of Variants in Cancer Susceptibility Genes Human Mutation 29: 1273-1281.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
17/17
Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. 2010. Locus-specific database domain and data content analysis: Evolution and content maturation towards clinical use. Human Mutation (accepted for publication).
© Copyright 2010 GEN2PHEN Consortium
HEALTH-F4-2007-200754 www.gen2phen.org
D7.2 Archives Established from Federated LSDBs
WP7 – DATA FLOWS
V1.3 Final
Lead beneficiary: UNIMAN Date: 11/08/2010 Nature: Report Dissemination level: PU
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
2/17
TABLE OF CONTENTS DOCUMENT INFORMATION ..................................................................................................3 DOCUMENT HISTORY .............................................................................................................3 DEFINITIONS ..............................................................................................................................4 1. 2. 3. 4. 5. EXECUTIVE SUMMARY ..................................................................................................5 INTRODUCTION.................................................................................................................5 OVERVIEW OF EXISTING LSDBS .................................................................................5 METHODOLOGY FOR IDENTIFYING LSDBS ............................................................7 POSSIBLE STRATEGIES FOR SECURING DATA ....................................................12 5.1. 5.2. 5.3. 5.4. 5.5. 6. 7. HOSTING A COPY OF THE DATABASE ON A GEN2PHEN SERVER. ..................................12 MIRROR DATABASES......................................................................................................12 CONVERT DATABASES TO LOVD, MUTBASE OR UMD FORMAT...................................12 MINIMAL ARCHIVING ....................................................................................................12 ARCHIVING IN CENTRAL REPOSITORIES ..........................................................................12
ACKNOWLEDGING THE ORIGINAL LSDBS. ...........................................................15 FUTURE WORK ................................................................................................................15 7.1. 7.2. 7.3. EXCLUSION OF DATABASES FROM ARCHIVING. ..............................................................15 OTHER STRATEGIES FOR IDENTIFYING LSDBS...............................................................15 REPEATING THE ARCHIVING PROCESS. ...........................................................................16
8. 9.
SUMMARY .........................................................................................................................16 REFERENCES....................................................................................................................16
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
3/17
Document Information
Grant Agreement HEALTH-F4-2007-200754 Number Full title Project URL Acronym GEN2PHEN
Genotype-To-Phenotype Databases: A Holistic Solution http://www.gen2phen.org
EU Project officer Dr. Iiro Eerola (Iiro.EEROLA@ec.europa.eu) Deliverable Work package Delivery date Status Nature Dissemination Level Report Public Number 7.2 Number 7 Contractual Title Title Month 30 Archives Established from Federated LSDBs Data Flows Actual Final Other 11/08/2010
Version 1.3 Prototype Confidential
Authors (Partner) M. Cornell, A. Devereau (UNIMAN) Responsible Author M. Cornell Partner UNIMAN Email michael.cornell@cmft.nhs.uk Phone +44 (0)161 2768716
Document History
Name Date Version Description
M. Cornell (UNIMAN) M. Cornell (UNIMAN) M. Cornell (UNIMAN) M. Cornell (UNIMAN)
12/7/10 21/7/10 06/08/10 11/08/10
V1.0 V1.1 V1.2 V1.3
Document creation Document revised after reviewers comments Consortium Review Document revised after consortium review
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
4/17
Definitions
Partners of the GEN2PHEN Consortium are referred to herein according to the following codes: ULEIC – University of Leicester (UK) – Coordinator EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary FIMIM – Fundació IMIM (Spain) – Beneficiary LUMC – Leiden University Medical Center (Netherlands) – Beneficiary INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary KI – Karolinska Institutet (Sweden) – Beneficiary FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary CEA – Comissariat à l’Energie Atomique (France) – Beneficiary EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary UAVR – Universidade de Aveiro (Portugal) – Beneficiary UWC – University of the Western Cape (South Africa) – Beneficiary CSIR – Council of Scientific and Industrial Research (India) – Beneficiary SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary UNIMAN – The University of Manchester (UK) – Beneficiary BIOBASE – BioBase GmbH. (Germany) – Beneficiary deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary PHENO – Phenosystems S.A. (Belgium) – Beneficiary BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary UPAT – University of Patras (Greece) – Beneficiary
Grant Agreement: The agreement signed between the beneficiaries and the European Commission for the undertaking of the GEN2PHEN project (HEALTH-200754). Project: The sum of all activities carried out in the framework of the Grant Agreement by the Consortium. Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant Agreement. Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal entities. Consortium agreement: Agreement concluded amongst GEN2PHEN participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
5/17
1. Executive Summary
This deliverable describes the development of a strategy for identifying public locus-specific databases (LSDBs) that require archiving to ensure that they are not lost. The classification of LSDBs produced as part of D2.3 (Technical State-Of-The-Art Document for G2P Databases) was used to identify a set of 73 LSDBs which require archiving to secure the data. A policy for securing these LSDBs, depending on the software used to develop them has been developed.
2. Introduction
There has been recent rapid growth in the number of publicly available locus-specific databases (LSDBs). In 2008, Cotton et al., noted that there were over 700 LSDBs while a review of LSDBs in 2009, by GEN2PHEN identified 1,188. With the development of high-throughput sequencing methodologies it seems reasonable to expect that, alongside further increases in the number of LSDBs, there will be a large increase in the reporting of variants. This will certainly be the case if the recent recommendation (Greenblatt et al., 2008) that all occurrences of variants are entered in LSDBs is followed. The storage of variant data in LSDBs has the advantage that it allows curation by an expert in that gene or disease. However, the disadvantage, compared to large central databases, is that many databases need to be maintained and therefore the risk of data loss is greater. This might occur for several reasons, including loss of funding or staff or computer system failures. Because of the clinical importance of variant data it is important that LSDBs are secured to prevent loss of data. The number of databases already in existence means that this will prove a major task and there is a need for systematic methods for identifying databases in need of archiving.
3. Overview of existing LSDBs
This work builds upon the domain analysis of LSDBs previously undertaken as part of WP2 (D2.3 Technical State-Of-The-Art Document for G2P Databases, Mitropoulou et al., 2010), in which the structure and content of each LSDBs was analyzed. This classification of LSDBs has allowed us to develop a systematic methodology for screening databases to identify those that might be at risk. For each database, the year of last update, software used to build the database, the gene and the database URL was listed. In addition, each database was classified according to its attributes as listed in Table 1.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
6/17
Database Attributes Disclaimer Collection via literature Contact curators Reference sequence Check Summary table listing all mutations HTML (no search option) Complete reference list Summary phenotypic description Links to references Restriction enzyme change shown Explanation of content and aim Useful links Links to OMIM Links to HGMD Copyright Online submission Chromosome location Information on gene HGVS Nomenclature Relational database Use of specific DBMS Ethnic group Detection method Querying tool(s) Field:Mutation name Field: Gene region Field: Codon number Field: Author name Field: Phenotype Field: Ethnic group Field: Geographic location Other fields for querying Counter Language other than English Information about disease List of associations Protein function Db description published Downloadable mutation table Mutation frequency Flat-file database Detailed phenotypic description Mutation visualization tool Cross-reference with other databases Links to other LSDBs
Number of Databases 934 1183 1161 1033 1183 1034 194 1039 882 1123 772 1024 1080 905 657 960 888 882 981 904 978 788 829 682 908 795 747 737 815 818 709 694 839 42 6 250 230 29 266 125 19 11 95 159 18 4
Table1: Attributes associated with public LSDBs from classification in D2.3. Attributes in italics have been used for identifying LSDBs requiring archiving. Three software types are listed in the LSDB classification: LOVD (589 instances), UMD (20 instances), MUTbase (115 instances). In addition, there are many LSDBs (454 instances) for which the software used to create the database is unknown (i.e. it could not be ascertained from
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
7/17
the LSDB website). Some of the LSDBs classified as unknown will not actually be databases in the sense that they do not have a DBMS (database management system). For example, the variant data may be listed within a table in the webpage and updating the database might mean adding more rows to the HTML table.
4. Methodology for identifying LSDBs
In order to decide which databases should be secured we have proposed the following list of questions: 1. How long ago was the database last updated? If a database is not being updated, it might indicate that the database is no longer being maintained. A total of 286 LSDBs (25% of all LSDBs) had not been updated in 2008 or 2009. However, it should be borne in mind that for some genes variants are less frequently reported. This may explain why an LSDB has not recently been updated. 2. Is the LSDB likely to be of clinical importance? If an LSDB is used as part of the analysis of genetic test results then loss of the database may have an impact on patients. The UK Genetic Testing Network website (http://www.ukgtn.nhs.uk/gtn/Home) lists the genes for which there are genetic tests within the UK. This list was used to determine whether an LSDB was likely to be of clinical importance. 3. Are there other LSDBs for this gene? If there are multiple LSDBs for the same gene then there is an alternative source of information about variants. 4. Are the variants named using the correct HGVS nomenclature? If the correct nomenclature (den Dunnen and Antonarakis 2001) has not been used, the database may not be as easy to search and would therefore be less useful. 5. Is the reference sequence used to name the variants given? This is of particular importance if there are changes that have been made to the sequence which would affect the variant names. Based upon these questions the decision tree shown in Figure 1 was developed. The LSDB classification from D2.3 was converted to tab delimited text format and a simple Java program was used to determine whether the LSDB should be archived according to the criteria shown in the decision tree. A total of 73 LSDBs were identified as requiring archiving. These are listed in Table 2.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
8/17
Figure 1: Decision tree used to identify LSDBs for archiving
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
9/17
Database ID ld11 ld16 ld36 ld46 ld72 ld86 ld88 ld102 ld145 ld223 ld224 ld225 ld226 ld227 ld228 ld229 ld230 ld231 ld232 ld241 ld243 ld285 ld356 ld375 ld376 ld377 ld390 ld403 ld422 ld473 ld474 ld611
Gene ACTC1 ADSL AP3B1 AQP2 ASS1 AVP AVPR2 BEST1 CASR CRYAA CRYAB CRYBA1 CRYBA4 CRYBB1 CRYBB2 CRYBB3 CRYGC CRYGD CRYGS CTSC CXCR4 DMD FBN2 FOXL2 FOXL2 FOXN1 G6PD GFI1 GM2A HEXA HEXB L1CAM
URL http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.icp.ucl.ac.be/adsldb/ http://bioinf.uta.fi/AP3B1base/ http://www.medicine.mcgill.ca/nephros/ http://chromium.liacs.nl/LOVD2/home.php?select_db=ASS1 http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www-huge.uni-regensburg.de/VMD2_database/ http://www.casrdb.mcgill.ca http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYAA http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYAB http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBA1 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBA4 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB1 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB2 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYBB3 http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGC http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGD http://grenada.lumc.nl/LOVD2/eye/home.php?select_db=CRYGS http://bioinf.uta.fi/CTSCbase/ http://bioinf.uta.fi/CXCR4base/ http://www.umd.be/DMD/ http://www.umd.be/FBN2/ http://medgen.ugent.be/LOVD2/home.php?select_db=FOXL2 http://medgen.ugent.be/foxl2/ http://bioinf.uta.fi/FOXN1base/ http://www.bioinf.org.uk/g6pd/ http://bioinf.uta.fi/GFI1base/ http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.rug.nl/umcg/faculteit/disciplinegroepen/medischegenetica/hereditarydi...
Database software Unknown Unknown MUTbase Unknown LOVD Unknown Unknown Unknown Unknown LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD MUTbase MUTbase UMD UMD LOVD Unknown MUTbase Unknown MUTbase Unknown Unknown Unknown Unknown
Last Update 2004 2001 2007 2003 2007 2003 2003 2007 2003 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2007 2003 2003 2003 2006
UKGTN gene FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE TRUE FALSE TRUE FALSE FALSE FALSE FALSE TRUE
Number of LSDBs for this gene 2 1 3 1 2 2 2 3 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 3 3 1 1 1 1 1 1 3
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
http://mecp2.chw.edu.au/ http://lsdb.hgu.mrc.ac.uk/home.php?select_db=MLYCD http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.umd.be/MYO7A/ http://ureacycle.cnmcresearch.org/otc/ http://www.dbpex.org/home.php?select_db=PEX13 http://www.dbpex.org/home.php?select_db=PEX14 http://www.dbpex.org/home.php?select_db=PEX16 http://www.dbpex.org/home.php?select_db=PEX19 http://www.dbpex.org/home.php?select_db=PEX2 http://www.dbpex.org/home.php?select_db=PEX26 http://www.dbpex.org/home.php?select_db=PEX3 http://www.dbpex.org/home.php?select_db=PEX6 http://bioinf.uta.fi/RAC2base/ http://bioinf.uta.fi/RFX5base/ http://bioinf.uta.fi/RFXANKbase/ http://bioinf.uta.fi/RFXAPbase/ http://www.retina-international.org/sci-news/rpgrmut.htm http://bioinf.uta.fi/SLC35C1base/ http://bioinf.uta.fi/SP110base/ http://bioinf.uta.fi/STAT5Bbase/ http://bioinf.uta.fi/STX11base/ http://bioinf.uta.fi/TAP1base/ http://bioinf.uta.fi/TAP2base/ http://bioinf.uta.fi/TAPBPbase/ http://bioinf.uta.fi/TCN2base/ http://genoma.ib.usp.br/TCOF1_database/index.php http://www.umd.be/TGFBR2/ http://bioinf.uta.fi/TNFRSF13Bbase/ http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://bioinf.uta.fi/TYK2base/ http://bioinf.uta.fi/UNGbase/
10/17
Unknown LOVD Unknown Unknown Unknown UMD Unknown LOVD LOVD LOVD LOVD LOVD LOVD LOVD LOVD MUTbase MUTbase MUTbase MUTbase Unknown MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase MUTbase Unknown UMD MUTbase Unknown Unknown Unknown Unknown MUTbase MUTbase 2001 2007 2004 2004 2004 2007 2007 2006 2006 2006 2007 2007 2007 2006 2007 2007 2007 2007 2007 1999 2007 2007 2007 2007 2007 2007 2007 2007 2004 2007 2007 2004 2004 2004 2004 2007 2007 TRUE FALSE TRUE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE TRUE TRUE FALSE FALSE FALSE FALSE 1 2 2 2 2 3 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 1 1 1 1 1 2 2 2 2 1 1
ld673 ld688 ld711 ld713 ld716 ld724 ld783 ld816 ld817 ld818 ld819 ld822 ld821 ld823 ld825 ld906 ld915 ld916 ld917 ld935 ld998 ld1022 ld1039 ld1040 ld1050 ld1052 ld1053 ld1066 ld1067 ld1073 ld1087 ld1091 ld1093 ld1102 ld1117 ld1121 ld1133
MECP2 MLYCD MYBPC3 MYH7 MYL2 MYO7A OTC PEX13 PEX14 PEX16 PEX19 PEX2 PEX26 PEX3 PEX6 RAC2 RFX5 RFXANK RFXAP RPGR SLC35C1 SP110 STAT5B STX11 TAP1 TAP2 TAPBP TCN2 TCOF1 TGFBR2 TNFRSF13B TNNI3 TNNT2 TPM1 TTN TYK2 UNG
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
http://www.umd.be/USH1C/ http://www.umd.be/USH1G/ http://bioinf.uta.fi/WASbase/ http://bioinf.uta.fi/ZAP70base/index.php
11/17
UMD UMD MUTbase MUTbase 2007 2007 2004 2007 FALSE FALSE TRUE FALSE 2 1 1 1
ld1137 ld1138 ld1151 ld1167
USH1C USH1G WAS ZAP70
Table 2: LSDBs identified as requiring archiving using the decision tree in Figure 1.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
12/17
5. Possible strategies for securing data
The following five strategies could be used for archiving and securing LSDBs. 5.1. Hosting a copy of the database on a GEN2PHEN server. This solution would allow users to access the database in the event of the original database being lost. In practice a copy is made of the original database and hosted on a GEN2PHEN server. There is no further link between the original database and the copy. This might be an appropriate solution if we have identified that the database is significantly out-of-date and is not likely to be updated. In such cases there is little danger of our database being out of sync with the original. However there are still several issues to be considered: • Do we know which DBMS the original database is using? • Is the DBMS compatible with software in use? • Is the database running on an old version of the DBMS? If so, how much work will be required to get it to run on an up-to-date version? • What software has been used to provide web access to the original database? 5.2. Mirror databases. Mirroring databases on other servers would prevent data loss and would also provide backup in instances where a server fails. As with 5.1 this would require detailed knowledge of the DBMS used for the original database. In addition, since LSDBs which we would consider archiving are, by definition, not being updated it’s not clear whether this strategy would provide any advantages over 5.1. 5.3. Convert databases to LOVD, MUTbase or UMD format. This option provides an additional benefit. By converting data into LOVD, MUTbase or UMD formats the data will not only be secured, it will also allow the data to be integrated with other variant data using software being developed by GEN2PHEN. However, again there are issues that need to be considered. • How easy will it be to map tables in the database to LOVD or UMD models? • Will each conversion of a database into LOVD or UMD format require a new set of programs to map the data into the new formats? 5.4. Minimal Archiving Data is archived so that while it is not publicly available, it can be kept secure in case the worst happens and the original database is lost. This is the minimum level of securing data. Because data will not be made public, unless the original data source is lost, there is no requirement to obtain the permission of the original database owners. 5.5. Archiving in central repositories As an alternative to maintaining a new version of an existing LSDB, the variant data could be submitted to a central repository, such as Ensembl or dbSNP. Mechanisms for submitting variant data to these repositories are in the process of being established.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
13/17
Of the 75 LSDBs shown in Table 1, 21 use LOVD software, 22 MUTbase, 6 UMD and 24 use unknown database software. LOVD, MUTbase and UMD software have been developed by GEN2PHEN partners and therefore strategies described in 5.1 may be appropriate. For those LSDBs which use unknown database software, this option does not appear viable. For some of these databases there may be no DBMS. For example, the TCOF1 database (http://genoma.ib.usp.br/TCOF1_database/index.php) does not appear to have a DBMS, the data is written in the HTML. In these instances, the suggested strategy is: 1. Perform minimal archiving, by “screen scraping” the data. At present this data is secured at University of Manchester. 2. Convert the minimal archived data to LOVD format, with permission from the original data owners. 3. Once the data has been converted to LOVD format, the database will then be secured under the same policies used for other LOVD instances maintained by LUMC. For most of these databases an empty LOVD database already exists, created as part of deliverable 4.4. Table 3 lists the 24 “unknown software” LSDBs and the action taken. Three of the databases appear to be using LOVD software, although one of these (ld376) cannot be accessed. In addition, the FHC Mutation Databases (ld11, ld16, ld46, ld86, ld88, ld102, ld145 and ld376) could not be found.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
14/17
ID ld11 ld16 ld46 ld86 ld88 ld102 ld145 ld376 ld390 ld422 ld473 ld474 ld611 ld673 ld711 ld713 ld716 ld783 ld935
Gene ACTC1 ADSL AQP2 AVP AVPR2 BEST1 CASR FOXL2 G6PD GM2A HEXA HEXB L1CAM MECP2 MYBPC3 MYH7 MYL2 OTC RPGR
URL http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.icp.ucl.ac.be/adsldb/ http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www.medicine.mcgill.ca/nephros/ http://www-huge.uni-regensburg.de/VMD2_database/ http://www.casrdb.mcgill.ca http://medgen.ugent.be/foxl2/ http://www.bioinf.org.uk/g6pd/ http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.hexdb.mcgill.ca http://www.rug.nl/umcg/faculteit/disciplinegroepen/medischegenetica/her editarydiseases/l1cam/index http://mecp2.chw.edu.au/ http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://ureacycle.cnmcresearch.org/otc/ http://www.retina-international.org/sci-news/rpgrmut.htm
LOVD exists Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes (empty) Yes Yes (empty) Yes (empty) Yes (empty) Yes Yes
ld1067 ld1091 ld1093 ld1102 ld1117
TCOF1 TNNI3 TNNT2 TPM1 TTN
http://genoma.ib.usp.br/TCOF1_database/index.php http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html http://www.angis.org.au/Databases/Heart/heartbreak.html
Yes (empty) Yes (empty) Yes (empty) Yes (empty) Yes
Action 404 not found minimal archiving completed minimal archiving completed minimal archiving completed minimal archiving completed LOVD database - minimal archiving completed minimal archiving completed LOVD but can't access (500 internal server error) minimal archiving completed minimal archiving completed minimal archiving completed minimal archiving completed appears to now be LOVD and is being updated. No action necessary minimal archiving completed 404 not found 404 not found 404 not found minimal archiving completed minimal archiving completed – variants don’t appear to use correct HGVS nomenclature minimal archiving completed 404 not found 404 not found 404 not found 404 not found
Table 3: Databases using unknown software
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
15/17
6. Acknowledging the original LSDBs.
It is important that if an LSDB is archived by creating an LOVD database to replicate the data that we give credit to the curators of the original database. Therefore we will: • Seek permission from the database curators before making the LOVD version publicly available. • If permission is not forthcoming we will create an archive of the database but will not make it public via an LOVD unless the original database becomes unavailable. • We will invite the curators of the original database to be curators of the new database. • We will provide a link to the original database (if still available).
7. Future Work
7.1. Exclusion of databases from archiving. As discussed in section 4, the methodology that has been developed for identifying databases for archiving excludes databases which do not list reference sequences and do not use correct HGVS format. As shown in Table 4, the effect of this policy is to exclude most of the “unknown” databases. This is clearly not an ideal situation since it is the unknowns which tend not to be updated and which appear most in danger of being lost. LSDBs LOVD UMD MUTbase Unknown 589 20 115 454 Not HGVS 1 0 0 279 No Reference Sequence 0 10 0 141 Total Excluded 1 10 0 342
Table 4: Numbers of LSDBs excluded from archiving because they do not follow HGVS nomenclature and/or do not list the reference sequence used. The problem of missing reference sequence may be solved by software being developed at NGRL Manchester. This allows a potential reference sequence to be tested against variants to see whether it could have been the reference sequence used to name those variants. In addition, the Mutalyzer Batch Checker (http://www.mutalyzer.nl/1.0.4) developed by LUMC allows the user to check all HGVS descriptions in combination with all potential reference sequences. 7.2. Other strategies for identifying LSDBs. The strategy adopted here to identify LSDBs for archiving used comparatively few of the database attributes identified in the D2.3 classification. It has been chosen for practicality and has allowed us to identify a set of LSDBs that require archiving and develop a set of policies for securing them. After completion, we may wish to extend this and develop alternative strategies. For example, it might be desirable to ensure that databases with ethnicity data are archived. Because of the systematic way in which the classification has been produced, alternative strategies would be easy to develop.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
16/17
7.3. Repeating the archiving process. Clearly, the archiving process should not be a one-off. LSDBs that have not been identified as in danger in this review might be in danger next year. It is intended that the manual review of LSDBs will be repeated, perhaps yearly. If so then the process of identifying LSDBs that are at risk and in need of archiving should be repeated. In addition, LUMC has developed an automated system for surveying LOVD installations which may identify those in need of archiving.
8. Summary
• Because of the growing number of LSDBs and the increasing volumes of data being generated, the archiving of LSDBs represents a huge challenge and it is important that systematic methods for identifying “at risk” LSDBs are developed. In this deliverable we have presented the first effort at developing such a systematic methodology. We have been able to develop this method because of previous efforts by GEN2PHEN members in developing the classification of LSDBs. This deliverable demonstrates the importance of developing the LSDB classification and underlines the need for the classification to continue. In the period between the classification of LSDBs (D2.3) and this piece of work, several of the “at risk” LSDBs appear to have been lost. This loss may not be permanent, the databases may just have been moved, but for anyone who used and relied on these databases the effect is the same. This underlines the need for data archiving strategies. Some of this data has already been saved. The data from the FHC Mutation Database (http://www.angis.org.au/Databases/Heart/heartbreak.html), which gave a 404 “not found” error have been secured by LUMC and are currently being added to the Leiden Muscular Dystrophy pages (genes ACTC1, MYBPC3, MYH7, MYL2, TNNI3, TNNT2, TPM1, TTN). This deliverable demonstrates the need for data standards. At present, databases which are not correctly naming variants and list reference sequences are excluded from archiving. For these LSDBs to be included will take considerable effort. It is clear that LSDBs developed using the LOVD and MUTbase “in a box” database solutions do not suffer from these problems. The development of this software appears to have helped to establish good practice.
•
•
•
•
9. References
Cotton R.G.H et al., 2008. Recommendations for locus-specific databases and their curation. Human Mutation. 29: 2 - 5. den Dunnen JT, Antonarakis SE. 2001. Nomenclature for the description of human sequence variations. Hum Genet 109:121–124. Greenblatt M.S. et al. 2008. Locus-Specific Databases and Recommendations to Strengthen Their Contribution to the Classification of Variants in Cancer Susceptibility Genes Human Mutation 29: 1273-1281.
© Copyright 2010 GEN2PHEN Consortium
HEALTH-200754
D7.2 –Archives Established from Federated LSDBs WP7: Data Flows Security: PU Author(s): M. Cornell (UNIMAN), A. Version: v1.3 Final Devereau (UNIMAN)
17/17
Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. 2010. Locus-specific database domain and data content analysis: Evolution and content maturation towards clinical use. Human Mutation (accepted for publication).
© Copyright 2010 GEN2PHEN Consortium
This document is © 2010 by acaciareiche - all rights reserved.
Tags:
- Login to post comments
