D1.5 Intermediate Report from Project Assessment Pilot
| Contributed by: | Acacia Reiche |
| Originally posted: | 16th March 2011: 4:30 pm |
| Last updated: | 1st July 2011: 11:52 am |
| Short URL: | http://gen2phen.org/node/36500 |
| Attachment | Size |
|---|---|
| D1.5 Intermediate Report from Project Assessment Pilot_v1.2_final.pdf | 1.63 MB |
Embedded Scribd iPaper - Requires Javascript and Flash Player
HEALTH-F4-2007-200754 www.gen2phen.org
D1.5 Intermediate Report from Project Assessment Pilot
WP1 – Scientific Coordination
V1.2 Final
Lead beneficiary: ULEIC Date: 23/02/2011 Nature: Report Dissemination level: Public
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
2/58
TABLE OF CONTENTS DOCUMENT INFORMATION .............................................................................................4 DOCUMENT HISTORY ........................................................................................................4 DEFINITIONS .........................................................................................................................5 1 2 INTRODUCTION............................................................................................................6 ANALYSIS OF VARIANTS BY CLINICAL SCIENTISTS.......................................6 2.1 2.2 2.3 2.4 2.5 LOCUS-SPECIFIC DATABASES. .....................................................................................6 TESTING MATCHED CONTROLS....................................................................................7 CO-OCCURRENCE IN TRANS WITH KNOWN DELETERIOUS MUTATIONS. ........................7 CO-SEGREGATION WITH THE DISEASE IN THE FAMILY. ................................................7 OCCURRENCE OF A NEW VARIANT CONCURRENT WITH THE (SPORADIC) INCIDENCE OF THE DISEASE............................................................................................................................7 2.6 IN SILICO PREDICTIONS................................................................................................7 2.7 RNA STUDIES.............................................................................................................8 2.8 FUNCTIONAL STUDIES.................................................................................................8 2.9 LOSS OF HETEROZYGOSITY.........................................................................................8 2.10 PRESENCE OR ABSENCE IN SNP DATABASES. .............................................................8 2.11 INTEGRATING LINES OF EVIDENCE...............................................................................8 2.12 CLASSIFYING VARIANTS..............................................................................................9 2.13 REPORTING VARIANTS. ...............................................................................................9 3 BIOINFORMATICS RESOURCES FOR CLINICAL SCIENTISTS.......................9 3.1 LOCUS SPECIFIC DATABASES.......................................................................................9 3.2 DATABASES FOR THE ANALYSIS OF BRCA1 AND BRCA2 VARIANTS .............................10 3.2.1 UMD databases. .....................................................................................................10 3.2.2 LOVD databases. ....................................................................................................10 3.2.3 Breast Cancer Information Core (BIC) database...................................................13 3.2.4 Diagnostic Mutation Database (DMuDB)..............................................................14 3.2.5. Human Gene Mutation Database (HGMD)...........................................................14 3.2.6 Single Nucleotide Polymorphism Database (dbSNP).............................................16 3.3 BIOINFORMATICS TOOLS ...........................................................................................17 ISSUES ...................................................................................................................................17 4 EXAMPLES OF VARIANT ANALYSIS BY CLINICAL SCIENTISTS................18 4.1 BRCA1 U14680.1.C.211A>G..................................................................................18 4.1.1 In silico predictions..........................................................................................18 4.1.2 Database searches ...........................................................................................19 4.1.3 Publications .....................................................................................................19 4.1.4 Classification. ..................................................................................................20 4.2 BRCA1 U14680.1 C.1067A>G ................................................................................20 4.2.1 In silico predictions..........................................................................................20 4.2.2 Databases.........................................................................................................21 4.2.3 Publications .....................................................................................................21 4.2.4 Conclusion .......................................................................................................23
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
3/58
4.3 5 6
ISSUES.......................................................................................................................23
THE ENIGMA CONSORTIUM ..................................................................................23 ROLE OF NEXT GENERATION SEQUENCING IN CLINICAL SEQUENCING. 24 6.1 ISSUES .......................................................................................................................27 GENOME-WIDE ASSOCIATION STUDIES............................................................27 USE OF GEN2PHEN DELIVERABLES. ...................................................................28 8.1 MUTALYZER .............................................................................................................28 8.2 WAVE (WEB ANALYSIS OF THE VARIOME)..............................................................29 8.3 HGVBASEG2P .........................................................................................................31 8.3.1 Searching HGVbaseG2P using a SNP.............................................................31 8.3.2 Searching HGVbaseG2P using a region. ........................................................32 8.3.3 Searching HGVbaseG2P using a gene name...................................................33 8.4 GEN2PHEN KNOWLEDGE CENTRE..........................................................................34 8.4.1 Locating GEN2PHEN resources in the Knowledge Centre. ...........................34 8.4.2 Using GEN2PHEN data via the Knowledge Centre........................................35
7 8
8.5 OBTAINING FEEDBACK FROM THE CLINICAL SCIENCE COMMUNITY. .....37 9 GENERATING LRGS FOR BRCA1 AND BRCA2...................................................38 9.1 9.2 9.3 9.5 9.6 9.7 9.8 10 11 10.1 AVAILABLE BRCA1 AND BRCA2 LSDBS AND REFERENCE SEQUENCES. ................38 COMPARISON OF REFERENCE SEQUENCES .................................................................39 ALTERNATIVE SPLICING ...........................................................................................40 ANNOTATION OF TRANSCRIPTS. ................................................................................41 OVERVIEW OF BRCA1 AND BRCA2 REFERENCE SEQUENCES. .................................46 SURVEY OF REFERENCE SEQUENCE USERS .................................................................46 DESIGN OF BRCA1 AND BRCA2 LRGS...................................................................48 BARRIERS TO DATA INTEGRATION AND GEN2PHEN SOLUTIONS .............................50
SUMMARY ....................................................................................................................49 REFERENCES...............................................................................................................52
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
4/58
Document Information
Grant Agreement HEALTH-F4-2007-200754 Number Full title Project URL http://www.gen2phen.org Acronym GEN2PHEN
Genotype-To-Phenotype Databases: A Holistic Solution
EU Project officer Dr. Iiro Eerola (Iiro.EEROLA@ec.europa.eu) Deliverable Number 1.5 Title Intermediate Report from Project Assessment Pilot Title Scientific Coordination Month 30 Other Actual final 23/02/2011
Work package Number 1 Delivery date Status Nature Dissemination Level Report Public Contractual
Version 1.2 Final Prototype Confidential
Authors (Partner) UNIMAN Responsible Author Michael Cornell Partner UNIMAN Email michael.cornell@cmft.nhs.uk Phone 0044 (0)161 276 8716
Document History
Name Date Version Description
Michael Cornell Michael Cornell Michael Cornell
07/02/2011 20/02/2011 23/02/11
1.0 1.1 1.2
First Draft Amended following comments from reviewers Amended following comments from consortium
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
5/58
Definitions
Partners of the GEN2PHEN Consortium are referred to herein according to the following codes: ULEIC – University of Leicester (UK) – Coordinator EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary FIMIM – Fundació IMIM (Spain) – Beneficiary LUMC – Leiden University Medical Centre (Netherlands) – Beneficiary INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary KI – Karolinska Institutet (Sweden) – Beneficiary FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary CEA – Comissariat à l’Energie Atomique (France) – Beneficiary EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary UAVR – Universidade de Aveiro (Portugal) – Beneficiary UWC – University of the Western Cape (South Africa) – Beneficiary CSIR – Council of Scientific and Industrial Research (India) – Beneficiary SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary UNIMAN – The University of Manchester (UK) – Beneficiary BIOBASE – BioBase GmbH. (Germany) – Beneficiary deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary PHENO – Phenosystems S.A. (Belgium) – Beneficiary BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary UPAT – University of Patras (Greece) – Beneficiary Grant Agreement: The agreement signed between the beneficiaries and the European Commission for the undertaking of the GEN2PHEN project (HEALTH-200754). Project: The sum of all activities carried out in the framework of the Grant Agreement by the Consortium. Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant Agreement. Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal entities. Consortium agreement: agreement concluded amongst GEN2PHEN participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
6/58
1
Introduction
The aim of the second Pilot Study is to assess GEN2PHEN software from the perspective of an external user. The study particularly focuses on diagnostic laboratory data users, who have a particular interest in determining the pathogenicity of sequence changes. In addition, we consider research data users who are developing tests to identify disease-causing genetic variations; and members of ENIGMA consortium (http://enigmaconsortium.org/), a worldwide group which is accumulating evidence about variants of uncertain significance with the aim of classifying their involvement in predisposition to breast and ovarian cancer. In the main, this pilot focuses on variants in the BRCA1 and BRCA2 genes, which are associated with breast and ovarian cancer. The ways in which these potential GEN2PHEN users generate and analyse variant data is discussed and the potential impact of GEN2PHEN outputs on these processes considered. When considering the potential impact of software it is important to note that there could potentially be huge changes in the ways in which genetic testing is carried out in the near future. The incorporation of next sequencing technologies (NGS) into diagnostic sequencing is only just beginning to be piloted. However, it is clear that it will be possible to sequence more genes, much faster than traditional Sanger sequencing techniques allow. This, combined with the reducing cost of using NGS, will mean that far more clinical sequence data will also be generated. We therefore need to ensure that the technology developed by GEN2PHEN will be able to cope with next generation diagnostic sequencing.
2
Analysis of variants by clinical scientists
The role of the clinical scientist in a diagnostic laboratory is to perform genetic tests, such as DNA sequencing or MLPA (multiplex ligation-dependent probe amplification), and provide analysis of any variants identified by the test. The tests performed by clinical scientists can be diagnostic or predictive. In the case of diagnostic testing, the individual tested has a phenotype, such as breast/ovarian cancer and the test is carried out in order to try and determine any underlying genetic cause. It may be, for example in the case of BRCA1 and BRCA2 sequencing, that testing of family members may follow to determine whether they have the same genotype. In contrast, predictive testing might determine the likelihood of an individual going on to develop a phenotype. Predictive testing is used for pre-implantation genetic diagnosis (PGD) of chromosomal abnormalities such as trisomy 21. The recent development of tests based on next generation sequencing of fetal DNA in maternal blood (Chiu et al., 2011) may lead to many more PGD tests being developed (Greely, 2011). The process of analysing variants involves assessing multiple lines of evidence in order to produce an overall assessment of each variant’s pathogenicity. The guidelines developed by the UK Clinical Molecular Genetics Society (CMGS) and the Dutch Society of Clinical Genetic Laboratory Specialists for the interpretation and reporting of unclassified variants (the UV guidelines, see http://www.cmgs.org/BPGs/pdfs%20current%20bpgs/UV%20GUIDELINES%20ratified.pdf ) list the following lines of evidence that might be used to assess pathogenicity: 2.1 Locus-specific databases. According to the UV guidelines LSDBs should contain accurate (curated), clearly referenced data naming variants at the DNA, RNA and
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
7/58
protein level and include all relevant comments relating to the clinical interpretation of the variant. It is considered ideal if LSDBs allow for repeated submissions of the same variant, from different individuals, rather than recording each variant only once.
2.2 Testing matched controls. This involves comparing the frequency of occurrence of a variant in a healthy control group. For example Górski et al (2005) compared the frequency of the BRCA2 C5972T variant in 3,241 cases of breast cancer diagnosed at under 51 years of age, with 2,791 ethnically matched controls. The authors state that this variant predisposes individuals to breast cancer. Comparing different histologic subgroups they found that the effect was most pronounced in women who had ductal carcinoma in situ (DCIS) with micro-invasion (odds ratio = 2.8; p<0.0001). From discussions with clinical scientists it appears that although the authors describe C5972T as pathogenic, this piece of evidence would be treated “with caution” by a diagnostic laboratory. The odds ratio of 2.8 is fairly low and the DCIS phenotype is considered fairly benign. The UV guidelines stress the importance of considering patient ethnicity when using matched controls. The occurrence of many SNPs appears to vary across populations according to country of origin (for example, see Sven Bergmann’s analysis of the CoLaus cohort presented at ESHG 2010 https://secure.medacad.org/eshg.org/fileadmin/www.eshg.org/abstracts/ESHG2010Abstracts. pdf). 2.3 Co-occurrence in trans with known deleterious mutations. This
may be useful in analysis of BRCA1 variants. There is evidence that homozygotes and compound heterozygotes of BRCA1 pathogenic mutations are embryonically lethal. Therefore, an unclassified variant which is in trans with a known pathogenic mutation in a patient may be classified as non-pathogenic. Determining whether a variant occurs in trans with a pathogenic mutation may require sequencing of DNA from parents.
Co-segregation with the disease in the family. This approach is most useful as a means of excluding pathogenicity in cases where a variant does not segregate with a given disorder. However, the penetrance of the variant could be an issue. Also, an unclassified variant may only appear to be pathogenic because it is in cis with an unidentified pathogenic mutation. 2.4 2.5 Occurrence of a new variant concurrent with the (sporadic) incidence of the disease. The de novo occurrence of a variant in a strong candidate
disease gene concurrent with the sporadic incidence of the disease could be considered as strong evidence of pathogenicity. There are other factors which need to be considered: does the mutation affect mRNA splicing or amino acid sequence? Is it possible that what appears to be a de novo deletion may in fact be derived from a parent who is heterozygous for the deletion with the second chromosome carrying a duplication (e.g. spinal muscular atrophy) Might non-paternity be an issue?
2.6
In silico predictions. Diagnostic laboratories use software tools to determine
the likely functional consequences of missense mutations and identify changes to mRNA
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
8/58
splicing patterns. Predictions based upon these software tools are considered to be acceptable in order to gain insight into the possible consequences of sequence variations but it is unacceptable to assign pathogenicity based solely on these results.
RNA Studies. RNA studies are regarded as the best means for gaining a definitive interpretation of putative splicing mutations. However, there are several limitations: the studies can be time consuming, not all laboratories have the facilities to perform these analyses and limited expression patterns may mean that the required tissue is not available for analysis. 2.7 2.8 Functional Studies. Protein functional studies provide a useful means of
assessing the consequences of amino acid substitutions. However, such tests are only available for a small subset of genes. In addition, proteins may contain multiple functional domains and have multiple functions. Therefore, it should be remembered that if a test indicates that an amino acid substitution does not affect protein function, this does not exclude the possibility that the same substitution might affect other functions of the protein.
Loss of Heterozygosity. Loss of heterozygosity (LOH) occurs in an individual with a germline mutation when the remaining functional allele in a somatic cell becomes inactivated by mutation. For example, in hereditary retinoblastoma, a child inherits from one parent a copy of the RB1 gene that carries a pathogenic change. Most cells will have a functional second copy but chance loss of heterozygosity events (somatic mutations) in individual cells lead to development of retinal cancer. Therefore, hereditary retinoblastoma indicates the presence of a germline mutation. The UV guidelines state that it is acceptable to use LOH to assist in the prediction of pathogenicity of variants in tumour suppressor genes, however this evidence is unlikely to be convincing in the absence of other lines of evidence. 2.9 2.10 Presence or absence in SNP Databases. If a variant is present in an
unaffected individual this may be taken as evidence that it is a benign polymorphism rather than a pathogenic variant. Databases such as dbSNP store the normal variants in a gene and can be used to help decide whether a variant is benign. However, dbSNP entries often lack frequency data and in some instances the contents of LSDBs have been transferred to dbSNP. Therefore, although it is considered essential that dbSNP is searched, the presence of a variant in dbSNP should not be used as sole evidence that it is non-pathogenic in the absence of convincing frequency information. In the future, data from the 1000 genomes project will also be incorporated into this type of analysis. 2.11 Integrating lines of evidence. Bayesian methods for integrating lines of evidence to assess the likelihood of variant pathogenicity have been proposed (e.g. Goldgar et al., 2008). However, such statistical tools are not yet currently available to diagnostic labs. Instead, clinical scientists rank the credibility of different lines of evidence. The most credible lines of evidence are from journal publications; the least credible from bioinformatic analysis (see Figure 1).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
9/58
Figure 1. Strategy for assessing lines of G2P evidence of variant pathogenicity. This figure has been taken from a training course for clinical scientists.
2.12 Classifying variants. After integrating all available lines of evidence the
clinical scientist will classify a variant. Classification strategies may not be the same in all testing laboratories but a typical set of classes might be. Class 1 – Certainly not pathogenic Class 2 – Unlikely to be pathogenic but cannot be formally proven Class 3 – Unable to alter classification Class 4 – Likely to be pathogenic but cannot be formally proven Class 5 - Certainly pathogenic
2.13 Reporting variants. The UV guidelines state that it is essential that laboratories
have mechanisms in place to submit results to existing databases (especially LSDBs). In addition, it is essential that laboratories issue an updated clinical report as new information becomes available to them (i.e. reports should be re-issued when a UV becomes clearly pathogenic or is not pathogenic anymore).
3
Bioinformatics resources for clinical Scientists
As discussed in section 2, clinical scientists make use of databases and sequence analysis tools to determine variants pathogenicity. The following databases are available for the analysis of BRCA1 and BRCA2 variants.
3.1
Locus specific databases.
LSDBs have been defined as ‘‘a collection of sequence variants in a specific gene that causes a Mendelian disorder or change in phenotype’’ (Cotton et al., 2008). There is an underlying assumption that an LSDB is curated by an expert(s) in that gene, and that this expertise is
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
10/58
reflected in the curation of variant data and distinguishes LSDBs from other variant database. LSDBs may be used by clinical scientists to help evaluate variant pathogenicity but this may not be their only use. Greenblatt et al (2008) list a further five possible uses including providing a catalogue of variants in a gene, providing details of DNA variation for in vitro assays and detailing ethnic geographic variation of genetic variants. Because of this, different databases may classify the same variant differently and conclusions may or may not be supported by sufficient reliable data (Greenblatt et al, 2008). There are currently no official standards describing the contents of an LSDB. However there have been recommendations made that would make LSDBs more useful for the classification of variants. Greenblatt et al (2008) made the following five recommendations: 1. LSDBs should only report a conclusion related to pathogenicity if a consensus has been reached by an expert panel. The panel should represent different areas of expertise (clinical, diagnostic, molecular, and computational). 2. The system used to classify variants should be standardized, using the five class IARC (International Agency for Research on Cancer) system developed by Plon et al (2008). Class 5 4 3 2 1 Pathogenicity Pathogenic Likely pathogenic Uncertain Likely neutral Neutral, no clinical significance Posterior Probability >99% 95-99% 5-95% 0.1-5% <0.1%
3. Evidence that supports a conclusion should be reported in the database, including sources and criteria used for assignment. 4. Variants should only be classified as pathogenic if more than one type of evidence has been considered. 5. All instances of all variants should be recorded.
3.2 Databases for the analysis of BRCA1 and BRCA2 variants
The following locus specific databases are available for BRCA1 and BRCA2 genes.
UMD databases. UMD databases exist for both BRCA1 (http://www.umd.be/BRCA1/) and BRCA2 (http://www.umd.be/BRCA2/) variants. These are private, password- protected databases. They store variants from a network of 16 French diagnostic laboratories. The BRCA1 database currently contains 4,222 entries for 1,143 for different mutations, while BRCA2 contains 4,972 entries for 1,513 mutations. 3.2.2 LOVD databases. There are several LOVD databases available BRCA1/2 Publication databases. (http://chromium.liacs.nl/LOVD2/cancer/home.php). These databases store information on BRCA1 and BRCA2 taken from journal articles. The BRCA1 database currently contains 1,465 entries on 502 unique variants, which are taken from 125 publications. The BRCA2 database contains 934 entries on 487 unique variants 3.2.1
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
11/58
which are taken from 60 publications. Both databases are fairly up to date, having been last updated in November 2010. The database content is shown in Figure 2
Figure 2: Example of variant data in BRCA2 publication database. LOVD 2.0 allows the two descriptions of variant pathogenicity, from both the submitter and the curator. The database does not use the 1-5 description of pathogenicity. Instead it uses the following: -? ? +? + No known pathogenicity Probably no pathogenicity Unknown Probably pathogenic Pathogenic
However in these LOVD databases (and in many others) only one classification is given. The second description is always left as “?”. BRCA1 classification databases (http://brca.iarc.fr/LOVD/home.php?select_db=BRCA1). These databases store classifications of previously unclassified BRCA1 variants obtained using a bioinformatics based approach (Tavtigian et al., 2008). Currently there are 112 such classifications in the database. As shown in Figure 3, this database also uses the same description of pathogenicity, but also stores the IARC classification in another column. Note that the two are not in agreement: c.65T>C is described as both “?/?”(i.e. it is considered to be unknown by both the submitter and curator) and “5-Definitely pathogenic”
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
12/58
Figure 3: Example of variant data in BRCA1 classification database. Zhejiang University databases (http://www.chinahvp.org/LOVD/home.php?select_db=BRCA1). These databases don’t currently appear to be working or the URL has changed. Australian Human Variome Databases (https://australianhumanvariomedatabase.arcs.org.au/) This site lists LOVDs for several cancer related genes, including BRCA1 and BRCA2, but at present none contain any variants. Fanconi Anaemia Mutation Database (http://chromium.liacs.nl/LOVD2/FANC/home.php?select_db=FANCD1). This database contains data for genes associated with Fanconi anaemia (FA). There are at least 13 genes associated with FA including FANCD1 (BRCA2). The database contains 57 entries on 35variants. All the variants listed are described as pathogenic. Some data from the BIC database (see 3.2.3) has been reproduced in this database (see Figure 4).
Figure 4: Example of BRCA2 variants in the Fanconi Anaemia Mutation Database.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
13/58
3.2.3 Breast Cancer Information Core (BIC) database. (http://research.nhgri.nih.gov/bic). BIC (Szabo et al., 2000) is a password protected database that stores variant data on BRCA1 and BRCA2 genes. It differs from the other LSDBs in that the pathogenicity of variants is decided by a database committee rather than by the data submitters. Therefore, although the database contains multiple instances of the same variant identified in multiple individuals, the pathogenicity of each instance will be the same. This policy has been in effect since 2006. Prior to then the effect of a variant was reported by the submitter. The change in policy reflects the concern that some BIC entries contained inappropriate conclusions. BIC now classifies the clinical importance of variants as ‘‘yes’’, ‘‘no’’ or ‘‘unknown’’. Unlike the LOVD publication database, BIC does contain unpublished data, much of it from clinical tests. The majority of this data (8826 of 12016 BRCA1 entries and 9891 of 11331 entries) has been supplied by Myriad (http://www.myriadtests.com). BIC does not provide details of when it was last updated, but the date of creation for each individual entry is available. As Figure 5 shows, although the database grew steadily between 1997 and 2004, there are very few entries since 2005 and none since 2008. BIC users are able to download the complete BRCA1 and BRCA2 variant lists in tabdelimited spreadsheet. However the variants are not named using the correct HGVS guidelines. Instead, information on nucleotide position (numbered from the start of the reference sequence and not the start codon) and type of variation are in different columns in the spreadsheet. In addition, variations described as insertions in BIC appear to have been wrongly named when tested with Mutalyzer, which identifies these variations as duplications.
Figure 5: Growth in the size of the BIC database BRCA1 and BRCA2 databases. The database entries refer to the total number of variant not the total number of unique variants.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
14/58
3.2.4
Diagnostic
Mutation
Database
(DMuDB).
(http://www.ngrl.org.uk/Manchester/projects/informatics/dmudb). DMUDB was initiated by Graham Taylor at the University of Leeds and has been developed by the National Genetics Reference Laboratory Manchester to store the results of diagnostic tests, including BRCA1 and BRCA2, for UK laboratories. The database differs from the LSDBs discussed above because it is patient referral centred rather than locus centred. The database currently holds 12,076 referrals, which contain more than 37,000 individual variants in 50 genes. There are reports for approximately 30,000 BRCA variants. Access is currently restricted to staff at UK laboratories. The database provides a mechanism for clinical scientists to store and share data that would otherwise remain unpublished. Users submit their data to the database with the understanding that they retain ownership of the data and can control the extent to which it is shared with other users. In addition, the pathogenicity of variants is decided by the submitter, rather than a committee as done by BIC. However, because UK labs work to similar standard operating procedures (SOPs) there appears to be a high degree of consistency for reports of the same variant from different UK laboratories.
3.2.5.
Human
Gene
Mutation
Database
(HGMD).
(http://www.hgmd.cf.ac.uk/ac/index.php) HGMD has been developed by the Institute of Medical Genetics in Cardiff and Biobase. The database is password protected but access is not restricted to the UK. There is a free version of the database and a more up-to-date version available to paying subscribers. It covers many genes (3,960 in the latest Professional release), including BRCA1 and BRCA2. HGMD is a database of published variants. It provides a link to the first published article on a variant, plus subsequent publications if the enhance the original report (Professional version). In some cases this could be as long ago as 1994. HGMD does not use the HGVS nomenclature to describe variants. For missense change it lists the codon position, the old and new codons and the amino acid change (see Figure 6a). For small (≤ 20bp) deletions, the deleted bases are shown lower case plus the flanking 10 bp in upper case. The numbered codon (if the deletion is in an exon) is indicated with a caret (^) (see Figure 6b). In addition HGMD does not associate a pathogenicity score with the variants but describes the phenotype associated with the variant. (a)
(b)
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
15/58
Figure 6: Examples of variants in the HGMD database showing alternative nomenclature for variants. Some HGMD data have been incorporated into Ensembl. Figure 7 shows the Ensembl view of HGMD entry CM034005 (codon 1458 changing from CAG to TAG), which creates a stop codon and is listed as causing breast cancer in HGMD. Ensembl shows the position of the variant but does not give any other details about the type of change or the phenotype.
Figure 7: Integration of HGMD data into Ensembl. HGMD entries are also stored in the HGVbaseG2P database (see Section 8.3). The variants cannot be searched for but are visible in the HGVbaseG2P genome browser (see Figure 8).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
16/58
Figure 8: Integration of HGMD data into HGVbaseG2P.
3.2.6 Single Nucleotide Polymorphism Database (dbSNP)
(http://www.ncbi.nlm.nih.gov/projects/SNP/) is a publicly accessible repository of genetic variation. As discussed in section 2.10 it is used by clinical scientists to determine whether the variants are benign polymorphisms. Variants from LSDBs have been incorporated into dbSNP. These are described as having “clinical association” and details of the source of the data can be viewed by selecting the VariationView option. Figure 9 shows the submissions for rs55968715 from the Ostrander Lab at the NIH and Lawrence Brody at the NHGRI (this is the BIC database). The table indicates that there is a third submission for this SNP, which is not currently listed. This came from Amanda Spurdle at Queensland Institute of Medical Research.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
17/58
Figure 9: Use of dbSNP Variation Viewer to display BRCA2 clinically associated variants.
3.3
Bioinformatics tools
Clinical scientists use two types of bioinformatics tools to analyse variant pathogenicity. These are missense analysis tools (typically AlignGVGD, SIFT and Polyphen) which estimate the effect of an amino acid change caused by a missense mutation and splicing tools (including Fruitfly, NetGene2 and Human Splicing Finder) which estimate the effect of a mutation on the splicing of an mRNA sequence. As discussed in sections 2.6 it is not considered acceptable for a variant to be described as pathogenic or non-pathogenic solely on the basis of bioinformatics analysis. These tools may not have been developed for clinical use and the accuracy of the results obtained has not been adequately assessed. Because of the uncertainty regarding the accuracy of these tools, clinical scientists will analyse the same variant using several tools and produce an “aggregate result”. To automate this repetitive and time-consuming task Alamut analyses a variant with multiple tools at once and integrate the results into a single view.
Issues
• The Unclassified Variants guidelines state that it is essential that laboratories have mechanisms in place to submit results to LSDBs and in the past they have been criticised for failing to do so. However, in the case of BRCA1 and BRCA2 variants it is not clear which LSDB they could use. The LOVD databases are intended for published data, or for previously unclassified variants, or for a different phenotype, or are not currently in use or accessible. The BIC database does contain clinical data but currently isn’t being updated, while the UMD databases store data for a particular country and are restricted access. This is also true of DMuDB although this policy may be about to change.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
18/58
•
•
•
• • •
•
DMuDB does not follow the usual model for an LSDB in that it does not have curators who are experts on particular genes. Instead the submitters are experts in the sequencing and analysis of variants. Their analyses will be performed to an SOP and will be verified by other experts before submission. Therefore DMuDB has moved from an expert curator to an expert submitter model. The curators’ expertise is in building and maintaining the database. This model does not appear to have restricted the growth of DMuDB, indeed it has the most BRCA variants of the databases considered here. The reporting of pathogenicity varies between LSDBs. The IARC values are only used in one database (the BRCA1 classification databases). It may limit the usefulness of integrating different LSDBs if the clinical interpretation of results cannot also be integrated. The use of the LOVD variant classification system allows curator and submitter to provide their classification of pathogenicity. However in the case of the BRCA1 and BRCA2 publication databases (and other LOVD databases) only one classification is given. This could cause problems when integrating since the classification “+/?” could mean “pathogenic/uncertain” or “pathogenic/no opinion given”. The lack of correct use of HGVS variant names in the some LSDBs, such as BIC and HGMD, will complicate the process of data integration. The integration of clinical variants into dbSNP does not include giving any details of the clinical significance of the variants. Are these pathogenic variants or benign polymorphisms that have been included in an LSDB. It is not clear whether the submissions of clinical variants to dbSNP are independent. For example, the Ostrander Lab has submitted data to BIC in the past. It might be the case that the same identification of a variant is submitted more than once: from the lab that generated the variants and from the LSDB. This could be problematic if this data is used to estimate the frequency of occurrence of these variants. Although, clinical scientists are using multiple tools to evaluate pathogenicity there are many more tools which do not appear to be used. Lists of tools are available at the GEN2PHEN Knowledge Centre (http://www.gen2phen.org/content/functionalprediction) and the PONP (Pathogenic or not pipeline) website (http://bioinf.uta.fi/PON-P/). It may be that these tools generate more reliable results than those currently in use. NGRL Manchester is currently conducting an assessment of clinical bioinformatics software and may recommend that other tools are adopted.
4
Examples of variant analysis by clinical scientists.
The following examples involve real variants which have been analysed by diagnostic laboratories. These examples are used in training courses organised by NGRL Manchester in order to demonstrate how bioinformatics resources should be used and information integrated. 4.1 BRCA1 U14680.1.c.211A>G. This variant causes an Arg to Gly substitution at position 71. The lines of evidence for this variant are: 4.1.1 In silico predictions: • Multiple sequence alignment of BRCA1 orthologs shows that Arg71 is a highly conserved residue
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
19/58
• • •
The change of Arg to Gly has a moderate Grantham distance (125) SIFT predicts that the change from Arg to Gly will affect protein function. The 211A>G is in the -2 position of a 5’ exon splice site. The change affects a predicted exonic splice enhancer (ESE) hexamer. However the change is not in the +1 or +2 and therefore could not be predicted to destroy a splice site. In addition ESEs are frequently predicted to occur within exonic sequences, RESCUE-ESE (http://genes.mit.edu/burgelab/rescue-ese/) predicts 332 ESEs within U14680.1.
4.1.2 Database searches
• • The R71G change has been reported 36 times in the BIC database. The BIC steering group has classified the change as clinically significant The BRCA1 LOVD database contains 5 entries for publications describing R71G (see Figure 10). Two list the variant as pathogenic (+/?) and three list it as non-pathogenic (-/?).
Figure 10: BRCA1 publication database showing c.211A>G submissions. • • This variant is not reported in dbSNP (Since writing this section the variant has been included in build 132 of dbSNP, it is been submitted via BIC). This variant is reported in HGMD. The paper cited (Diez et al 1999) describes several variants identified in 83 Spanish breast cancer/ovarian cancer families and describes this variant as “missense mutations of unknown significance”.
4.1.3 Publications
Searching Pubmed using “BRCA1 c.211A>G” identifies one paper (Santos et al., 2009) while searching using “BRCA1 R71G” identifies the Santos et al (2009) and Diez at al (1999) papers plus a third paper by Vega et al., (2001). Neither search identified the Pettigrew et al., (2005), Ruffner et al (2001) or Morris et al. (2006) papers listed in LOVD.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
20/58
Searching using Google scholar identified rather more papers, 64 for BRCA1 R71G and seven for BRCA1 c.211A>G. No hits were obtained using the full HGVS nomenclature for this variant. The papers identified in Pubmed and LOVD give conflicting reports on the pathogenicity of this variant. Ruffner et al (2001) use an enzyme assay that measures E3 ubiquitine ligase activity. Their results show that although the R71G change occurs with the BRCA1 RING domain, it does not appear to affect enzyme function. The Morris et al., paper also described an enzyme assay and considers the position of the changed residue with the 3D structure of the protein. Their results also indicate the R71G does not affect enzyme function. The paper by Pettigrew et al (2005) is a comparative analysis of predicted ESE sites in BRCA1 orthologs. It is not certain why this paper is reported in the LOVD as describing the 211A>G variant as pathogenic. The paper does not refer to the 211A>G variant, although it does contain data on 330A>G which may be the same variant (there is a 119 bp 5’ UTR, adding this to 211 would place the variant at position 330). However, the paper lists this variant as increasing the ESE (exonic splicing enhancer ) motif score. The crucial piece of evidence in determining the pathogenicity of this variant is provided by Vega et al (2001). They demonstrate that 211A>G is responsible for aberrant splicing of the BRCA1 transcript. The authors demonstrate this using RT PCR on mRNA derived from peripheral blood cells. Two PCR products were identified corresponding to the unaffected and 211A>G alleles. Sequencing of the 211A>G PCR product showed that 22bp of exon 5 were deleted, creating a new stop codon within exon 6. It is interesting to note that again the authors did not refer to this variant as 211A>G but as 330A>G. As well as identifying the effect on splicing the authors also observed the co-segregation of the mutation with the disease, in a large family. The paper by Santos et al (2009) also shows the effect of 211A>G on splicing.
4.1.4 Classification. Because of the effect on splicing observed by Vega et al (2001)
and Santos et al (2009), this variant has been classified as Class 5, certainly pathogenic.
4.2
BRCA1 U14680.1 c.1067A>G. This variant causes a Gln to Arg substitution
at position 356. The lines of evidence for this variant are:
4.2.1 In silico predictions
• • • • • • • Multiple sequence alignment of BRCA1 orthologs shows that Gln356 is a moderately conserved amino-acid The change from Gln to Arg has a small Grantham distance (43) SIFT predicts that the change is tolerated Polyphen predicts that the change is probably damaging Polyphen 2 predicts that the change is probably damaging (score = 0.977) Align GVGD predicts that the change is tolerated. Splice site analysis tools predict a minor effect on splicing. The variant could introduce new ESE hexamers
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
21/58
4.2.2 Databases
• • • • The variant is reported in BIC 82 times and described as “clinical importance unknown”. The variant is reported in dbSNP (rs1799950), allele freq 0.031 (N 495) HGMD lists a publication (Schoumacher et al, 2001) which gives no firm evidence for or against pathogenicity. The BRCA1 LOVD database contains thirteen entries for publications describing Q356R. Four describe the variant as pathogenic (+/?), four non-pathogenic (-/?) and five as uncertain (?/?) (see Figure 11).
Figure 11: BRCA1 publication database showing c.1067A>G submissions.
4.2.3 Publications
Evidence from papers listed in BRCA1 LOVD and a further six papers identified in Pubmed (marked *) are listed in Table 1. Publication Clinical Evidence effect Burk-Herrick et ? Analysed 154 mutations from BIC using an alignments al. (2006) based on 132 mammalian sequences and compares results obtained using SIFT and results obtained by Fleming et al., (2003) Cox et al (2005) Used whole-gene resequencing data to examine the association between BRCA1 SNPs and breast cancer using 1323 cases and 1910 controls. Observed homozygotes of the 356Arg allele in both cases and controls. Diez et al., ? Identified the 1186A>G variant in 7% of Spanish (2003) breast/ovarian cancer patients. No comment regarding pathogenicity. Greenman et al., Describe Q356R as a polymorphism because it does not 1998 meet their criteria for pathogenicity. These are: • Segregation with the disease • Absence in ethnically matched controls • Nonconservative amino acid substitutions
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
22/58
Johnson et al., + (2007)
Lee et al., (2008)
+
McKean-Cowdin et al (2005) Miki et al (1994)
?
?
Tavtigian et al., (2006) (listed as 2005 in LOVD) Tomassi et al., + (2008) Auranen (2005) * et al -
Hadjisavvas al., (2004) *
et +
Janezic et (1999) *
al., +
Menzel et (2004) *
al., -
Residue conserved in the murine and canine homologues of BRCA1 • Occurrence within a conserved and possibly functional motif. Analysed 1037 non-synonymous SNPs in candidate cancer genes in 2463 controls and 473 breast cancer cases with two primary breast cancers. Of all the SNPs assessed in this study Q356R had the highest odds ratio (1.72, minor allele frequency = 5.4%). Used alignment based method. Alignment for BRCA1 residues 225 to 1365 from 55 organisms. Their method indicates that this variant is deleterious, in agreement with polyphen. Identified BRCA1 variants in African-American and Latina women and compared variant frequencies in both populations. No evidence presented relating to Q356R pathogenicity. Early paper identifying candidate breast cancer gene. No evidence specific to this variant, possibly due to renumbering. Described Q356R as neutral based upon co-occurrence with clearly deleterious mutations in BRCA1, and prediction byAlign-GVGD. Assessed pathogenicity using multiple bioinformatic tools. All software used demonstrated a possible biological implication of Q356R BRCA1. Investigated whether polymorphisms in DNA double strand break repair genes are associated with epithelial ovarian cancer (EOC) risk. Study involved 1,600 cases and 4,241 controls from 4 separate genetic association studies from 3 countries. No association detected between EOC risk and Q356R. A pair of rare variants, Q356R and S1512I, was detected in BRCA1 in patients belonging to two Cypriot families. The simultaneous presence of this pair of missense mutations may be associated with the breast cancer phenotype in the Cypriot population Study aimed to provide more accurate frequency estimates of breast cancer susceptibility gene 1 (BRCA1) germline alterations in the ovarian cancer population. The rare form of the Q356R polymorphism was significantly (P = 0.03) associated with a family history of ovarian cancer, suggesting that this polymorphism may influence ovarian cancer risk. Two case control studies in Tyrol and Prague. Did not identify any association between Q356R and breast cancer.
•
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
23/58
Seymour et al., (2008) * Wenham et al., (2003) *
Complete sequencing of coding regions from 217 women from high-risk breast cancer families and 155 age-matched controls. Q356R did not show a significant risk association Case control study of ovarian cancer was performed in North Carolina. Study involved 312 women with ovarian cancer and 401 age and race-matched controls. Q356R not associated with ovarian cancer risk.
Table 1: Overview of publications describing the pathogenicity of BRCA1 c.1067A>G variant. 4.2.4 Conclusion. This variant has been classified as Class 2 – Unlikely to be pathogenic but cannot be formally proven. The variant is thought to probably be a benign polymorphism on the basis of several large association studies. However it is not possible to completely exclude minor effects.
4.3
Issues. As these two examples demonstrate, the job of the clinical scientist in
integrating multiple lines of evidence can be challenging. There are several issues: • There is no formal method for integrating variants and no definition of what is meant by pathogenic. • Bioinformatics tools can produce contradictory answers. • LSDBs may not provide a clear view of the likely pathogenicity of a variant. • It is almost impossible for an LSDB curator to ensure that their database is up-to-date and that no key papers are missed, especially for genes such as BRCA1 and BRCA2 where there are large numbers of laboratories generating data. This task is made even more difficult by authors using alternative naming strategies to describe their variants. • Clinical scientists may need to evaluate the evidence from many publications. • Some published articles use much less rigorous standards in evaluating pathogenicity than clinical scientists. Some of the evidence used to assign pathogenicity in the publications discussed above would not be considered sufficient by diagnostic laboratories. • These discrepancies are reproduced in LSDBs, for example describing a variant as pathogenic on the basis of a Grantham score would not be acceptable for a diagnostic laboratory. • There may be one critical piece of evidence which decides how a variant is classified. For example, in the analysis of c.211A>G the splicing analysis by Vega et al (2001) is key to determining that the variant is pathogenic. If this piece of evidence was not identified by the clinical scientist analyzing this variant, they may decide that the variant was less likely to be pathogenic based on the enzyme assay developed by Ruffner et al (2001).
5
The ENIGMA consortium
The ENIGMA consortium has been established in order to classify BRCA1 and BRCA2 variants which are currently unclassified. The consortium will function by pooling information, for example on segregation within BRCA families or immunohistochemistry
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
24/58
results; and by selecting sets of variants for analysis, for example mRNA splicing analysis or enzyme assays. This is a similar approach to that used by clinical scientists. Multiple lines of evidence are combined in order to produce a classification and the lines of evidence are the largely the same as those that would be considered by a clinical scientist. However, the variants being considered are ones that clinical scientists have been unable to classify often because there was not sufficient data available for them to develop a classification. In order to store the multiple lines of evidence, ENIGMA is developing a relational database. This database will not be made accessible outside the consortium. Instead, Enigma will make classifications available both via publications and by submission to the BIC database. The process of requirements gathering for the ENIGMA database is still ongoing. However from responses received to date it is clear that the consortium will require far more data to be stored than is usually the case for LSDBs. The following are a selection of the proposed data fields for in vivo analysis of mRNA splicing: • Forward Primer • Reverse Primer • RNA source - Cells / tissue (and collection method) • Culture conditions • Nonsense mediated decay inhibition • RNA extraction method • Dnase1 treatment • RNA storage • Amt RNA used in cDNA synthesis • cDNA synthesis primer • cDNA synthesis protocol • PCR Amplification • PCR product analysis • Aberration(s) Detected Experimentally - Qualitative data • Aberration(s) Detected Experimentally - Quantitative Data - including allele-specific detection assays • Aberration(s) at RNA level - HGVS nomenclature • Aberration(s) at protein level - HGVS nomenclature • Genomic Co-ordinates of Aberration(s) • Level of full-length transcript produced by variant allele (%) • Overall Level of full-length transcript (%) • Methodological deficiencies • Comment • Qualitative 5-class IARC Splicing Classification (Spurdle et al., 2008)
6 Role of next generation sequencing in clinical sequencing.
At present the genetic screening of breast cancer patients focuses on the sequencing of BRCA1 and BRCA2 genes. However, these are not the only genes associated with breast and ovarian cancer. Genes associated with other diseases such as BRIP1 and PALB2 (Fanconi
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
25/58
anemia), TP53 (Li-Fraumeni syndrome) PTEN (Cowden syndrome), STK11 (Peutz-Jeughers syndrome) and CDH1 (hereditary diffuse gastric cancer) have been associated with breast cancer (Seal et al., 2006; Rahman et al., 2007; Gonzalez et al., 2009; FitzGerald et al., 1998; Hearle et al., 2006; and Schrader et al., 2008), while genes responsible for HNPCC have been associated with ovarian cancer (Aarnio et al., 1999). Therefore sequencing only BRCA1 and BRCA2 may miss the underlying mutations responsible for breast or ovarian cancer. In addition, genetic testing for BRCA1 and BRCA2 mutations tends to be used for women where there is a family history of breast or ovarian cancer. However, in some cases breast cancer patients will not have a family history of breast cancer, despite an underlying somatic mutation, because their mutation was paternally inherited, their family is small and no other female family members inherited the mutation. The recent advances in sequencing technologies could help to solve both these problems. Using next generation sequencing, it will be possible to sequence many genes simultaneously and the speed and, in time, the reduced costs associated with these technologies may mean that more patients can be tested. An example of how next generation sequencing may be used is provided in a recent publication by Walsh et al., (2010). In this study 21 genes responsible for inherited risk of cancer were fully sequenced for 20 patients and small variants and large deletions and duplications identified and analysed. Clearly, if more genes are being sequenced in a larger number of patients, the amount of bioinformatics analyses required will increase. Part of the analysis method used by Walsh et al., (2010) is shown in Figure 12. Locus-specific databases and dbSNP were searched to establish whether a variant was known to be either pathogenic or a benign polymorphism. Candidate variants were checked to determine whether they were exonic, intronic or intergenic and the effect of the variation (nonsynonymous substitution, frameshift, splice site mutation etc).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
26/58
Figure 12: Methodology for analysis of variants used by Walsh et al., (part of figure from Walsh et al., (2010)) As well as being used to identify genetic causes of breast cancer, next generation sequencing techniques have been used to identify the genes underlying rare Mendelian disorders. For example, Ng et al., (2010) identified DHODH as a candidate gene for Miller Syndrome and Bilgüvar et al., (2010) identified WDR62 mutations in severe brain malformations. A feature of these studies is that they require the sequencing of only few individuals. Ng et (2010) al., sequenced four individuals, including two siblings, while Bilgüvar et al., initially sequenced two individuals from a small consanguineous family and then identified WDR62 mutations in other individuals. Whole exome sequencing will generate large numbers of variants. To avoid having to analyse thousands of variants, researchers have developed analysis methods which focus on filtering as many variants from further analysis as possible. For example, Ng et al began by focussing only on non-synonymous variants, splice acceptor and donor site mutations and short insertions or deletions. The same variant had to be present in both siblings and a variant had to be present in the same gene in the other two kindreds. Common variants (present in dbSNP129 or HapMap8) were excluded. This reduced the number of candidate genes to nine, assuming a recessive model of inheritance for Miller syndrome. This type of methodology is suitable for a disorder such as Miller syndrome. Variants causing a rare disease such as Miller syndrome are unlikely to be present in dbSNP and because the disease has a well defined phenotype; variants can be eliminated by comparing individuals. As well as identifying candidates for rare diseases, whole exome sequencing is being used to identify candidates for more frequently observed phenotypes. Vissers et al (2010) investigated de novo mutations which might be responsible for mental retardation by sequencing eight parent child trios. Because the authors were searching for de novo mutations, they were able to exclude variants identified in the children that were inherited from the unaffected parents. In addition, variants present dbSNP were also excluded as were nongenic, intronic and synonymous variants. The extent to which NGS will be used in clinical genetics is still not clear. However, next generation sequencers are already being used for clinical sequencing. For example, sequencing of clinically important genes, including BRCA1 and BRCA2, is now being offered by NewGene (http://www.newgene.org.uk) using a Roche 454 sequencer. In this case an existing test is being offered using techniques which allow higher throughput and faster turnaround times. NGS will also allow an increase in the range of genetic tests available. For example, the Manchester Biomedical Research Centre is developing improved eye gene tests. To date, roughly 140 genes have been associated with eye disease. This is clearly too many to be sequenced in a conventional test, instead a new method using targeted exon enrichment and sequencing using a SOLiD sequencer is being developed.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
27/58
As “third generation” sequencers become available there may be yet more changes to clinical sequencing. These methods, such as SMRT (single molecule, real time) developed by Pacific Biosciences (http://www.pacificbiosciences.com), and label-free single molecule sequencing, developed by Oxford Nanopore Technologies (http://www.nanoporetech.com/) do not require PCR amplification of template DNA and should be faster than second generation technologies such as Illumina and SOLiD sequencers. These techniques could also be much less expensive than conventional sequencing, the “$1000 genome” has been suggested but even cheaper genomes may be possible. GnuBio have suggested that a whole genome might be sequenced for $30 (http://fluidicmems.com/2010/06/03/gnubio-will-droplet-basedsequencing-from-the-weitz-lab-win-the-race). The development of these techniques may mean that clinical whole genome sequencing becomes a technical and financial possibility, even if the means to interpret the data is not in place. As well as being cheaper and faster, third generation sequencing offers the prospect of new types of sequence being generated. For example, the Pacific Biosciences SMRT sequencer has been used to directly detect DNA methylation without the need for bisulfite conversion (Flusberg et al., 2010). Other third generation methodologies such as nanopore sequencing will offer similar possibilities. This raises the possibility of new types of clinical sequencing experiments such as identification of epigenetic modifications as a part of cancer diagnosis (Costa, 2010).
6.1
•
Issues
A diagnostic test involving the sequencing of many genes, such as the eye disease genes test described above, raises questions about how the variant data should be managed. One model might be that the variants identified in 140 genes would then be sent to 140 different LOVDs. However, it is unlikely that those generating the data would be happy with having to repeat the same task 140 times. Instead, it might be more reasonable to expect them to add their data to a single database and this database might then be used to distribute their data across LSDBs. For example data might be stored in a repository such as DMuDB. This repository then uses Café RouGE (http://www.caferouge.org) to inform curators of the variants. In order to deal with the large numbers of variants identified by NGS experiments, bioinformaticians will need to develop informatics pipelines to automate as much of the analysis as possible. A pipeline that analyses variants from 140 genes will ideally not have to include 140 different LSDBs, since the likelihood of part of the pipeline failing will increase with each web service added. In addition, there would be problems caused by the lack of standards for describing pathogenicity across different LSDBs.
•
7
Genome-wide association studies.
GWA studies are a method of performing an association study without having prior knowledge of which genes are likely to be involved (Need and Goldstein, 2010). They have been used in recent years to explore the relationship between common genetic variation and disease, biological characteristics and drug responses. Underpinning GWAS is the theory that
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
28/58
common diseases are caused by common variants (Chakravarti, 1999; Reich and Lander, 2001). However, while GWA studies have generated hundreds of confirmed susceptibility factors, the variants that are identified are only responsible for a small fraction of the genetic variation (Goldstein, 2009). Because of this, these results are considered of little diagnostic utility by clinicians and are not incorporated in clinical tests. The reason for this may be that GWAS experiments analyse shared ancestral SNPs which have been maintained within the population for many generations and occur frequently (minimum allele frequency > ~5%) in the population. Because of this it is argued that any variant that has survived within a population for a large number of generations cannot be highly pathogenic. There is some evidence for this argument since most of the factors identified in GWAS studies have odds ratios well below 1.5 (Read and Donnai, 2010). This has led to the development of the alternative to the “common disease, common variant” theory, which is that rare variants can play an important role in the development of common diseases (for example, see Cirulli and Goldstein 2010). Another hypothesis is that while the effect of individual SNPs is small the combined effect of having multiple SNPs may be much larger (for example see comments by Kári Stefánsson in Guttmacher et al., 2010). If this is correct and combinations of common variants can be shown to have a significant effect on genetic diseases, then it may be that future genetic testing will incorporate GWAS based evidence. If so, there may be a need to develop more precise definitions of pathogenicity to distinguish low penetrance GWAS alleles from the more highly penetrant ones associated with current genetic testing.
8
Use of GEN2PHEN deliverables.
The aim of the previous sections was to establish how clinical scientists use G2P data and how this might be affected by the development of new sequencing technologies. In the following sections, we consider how GEN2PHEN deliverables might be used by clinical scientists. Our intention was to highlight issues that might typically arise when the software is used.
8.1
Mutalyzer
The Mutalyzer software is widely used in diagnostic laboratories and is recommended for use by clinical scientists in the Unclassified Variant guidelines discussed in Section 2. The development of Mutalyzer 2.0 makes it more suitable for the analysis of large numbers of variants. Batch analysis is available which allows rapid checking of many variants in one submission. When tested we were able to check more than 2000 variants in approximately one minute. In addition the development of a web service will enable Mutalyzer to be included in an automated pipeline. Issues The following issues arose when testing this software: Reference sequences not found: Mutalyzer was unable to find some of the reference sequences used by diagnostic laboratories. For example L11353.1, which is used as a
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
29/58
reference sequence for NF2 testing, M28668.1 (CFTR) and AF395588.1 (AFDR) returned a “Gene not found error message”. It is not clear why these sequences cannot be retrieved. Nucleotide not found error messages. Mutalyzer compares the change submitted by the user with the reference sequence and returns an error if the bases do not match. For example, U43746.1:c.7397T>C give the error message T not found at position 7625, found C instead. As this example shows the error message appears to refer to a different nucleotide (7625) to the one submitted by the user (7397). This difference is caused by the user numbering variants according to their position relative to the A of the start codon while the Mutlyzer error message numbers from the first position in the reference sequence. It might be useful if this difference could be made clear in the error message. Checking intronic variants. Intronic variants are described by their position relative to the splice sites in an mRNA sequence. For example, U14680.1. c.212+3A>G occurs 3 bases into the intron which begins after the nucleotide corresponding to position 212 in the mRNA. However, Mutalyzer cannot check an intronic sequence using a reference sequence which does not contain intonic sequences and returns an error message (Error: (Mutalyzer): Intronic position given for a non-genomic reference sequence.). Use of LRGs The problem of checking intronic variants could be solved by the use of LRGs to describe variants since these reference sequences contain both the transcript and genomic sequences.
8.2
WAVe (Web Analysis of the Variome).
WAVe is locus-specific database integration tool. It integrates variants from multiple locus specific databases and provides links to the original LSDBs. It also allows users to obtain data about the gene locus (Gene Cards, HGNC GeneNames, Entrez Gene Report), publications (QuExT searches by gene and disease (Matos et al., 2010)), disease (OMIM), pharmacogenomics (pharma knowledge base), genome (NCBI Map View and Ensembl), pathways (KEGG and Reactome), protein (SwissProt, TrEMBL, PDB, ExPASy, InterPro), GO terms, DiseaseCard and the GEN2PHEN Knowledge Centre. The process of integrating variants is achieved by warehousing variant data into a WAVe database. Variants are gathered from LSDBs in two ways. For LSDBs instances recent versions of LOVD (2.0 and later), the variant API is used to import all variants into the WAVe database. For the remaining systems (UMD, MUTbase and other legacy applications) a web crawler, is used to find variants described in the HGVS format in the webpage's HTML. During the variant gathering process, the HGVS name is checked in order to identify the type of change associated with each variant (“del”, “ins”, etc). The addresses of LSDBs are taken from a manually curated list incorporating the LSDB listing available on the GEN2PHEN Knowledge Centre and the LOVD database index.
Issues For BRCA1 and BRCA2, WAVe provides links to the following LSDBs: • The Australian variome database (empty)
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
30/58
• • • •
Chinese databases with a broken link (domain name does not exist). This may be the old URLs of the Zhejiang University databases. LSDBs, the BRCA1/2 Publication databases Zhejiang University databases UMD databases.
This is not a problem with WAVe but highlights some of the problems for curators of lists of LSDBs. • There are many LSDBs which are empty, more so because GEN2PHEN has created large numbers of LSDBs that are awaiting curators. • Broken links caused by databases being moved. • Some of the LSDBs cannot be accessed because they are password protected. For example, of the 616 BRCA1 variants listed in WAVe, none appear to come from the UMD database, presumably because this database is password protected. However, because the UMD database is included in the list of BRCA1 LSDBs it appears that variants from this LSDB will be included in the integrated list. • Databases are excluded from the list. For example, the BIC database is not included. This might be because it is password protected or because the variants are not named in accordance with the HGVS nomenclature. However, this is unfortunate because clinical scientists consider BIC classification to be an important line of evidence when evaluating BRCA1/2 variants. Other databases may be excluded simply because they are not widely known about. For example, the BRCA1/2 classification database (2.1.2.2) does not feature in any of the curated lists. In the GEN2PHEN 6th General Assembly Meeting there was some discussion on the need to develop a unified list of LSDB resources. The WAVe project highlights the need for such a list as well as the need for methods to ensure that the list can be kept up to date. Update: During the process of preparing this report a list of LSDBs has been made available at the GEN2PHEN Knowledge Centre (http://www.gen2phen.org/data/lsdbs) and is temporarily mirrored at EBI (http://www.ebi.ac.uk/~pontus/lsdb_list.php.html). The mirror site at the EBI will shortly be replaced by one at LRG website (http://www.lrg-sequence.org/page.php). Publication searches. WAVe allows the user to identify publications linked to their gene of interest. Searches are performed using QuExT (http://bioinformatics.ua.pt/quext/) and can be “by Gene” or “by Disease”. For BRCA1, these searches identify 442 and 358 articles respectively. • The number of publications returned is much lower than the number identified using Pubmed (7684 articles). • QuExt does not appear to allow the user to search for a specific genetic variant, e.g. BRCA1 Arg71Gly. • QuExt does not return publications in date order. This makes it difficult for a user to identify those that are recently published.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
31/58
8.3
HGVBaseG2P
HGVBaseG2P is a database that stores summary level findings from genome wide association studies (GWAS). It currently contains data from 565 such studies.
8.3.1 Searching HGVbaseG2P using a SNP.
HGVbase was searched with the SNP rs4986852. This is situated within the BRCA1 gene and corresponds to the missense substitution NM_007294.3.c.3119G>A changing Ser to Asn at position 1040. This variant is listed as “uncertain significance” in BIC and six of the eight entries in the LOVD publication database also describe it as uncertain (the other two state that it is non-pathogenic). Searching HGVbaseG2P with rs4986852 returns a marker, showing that this SNP has been used as a marker in GWAS.
Figure 13: Result of searching HGVbaseG2P for rs4986852. The reference sequence coordinates for the SNP are listed as 38497955 on chromosome 17 (see Figure 13). This is different from the coordinate listed in dbSNP (41244429) and outside of the current coordinates of the BRCA1 genes in the NCBI reference sequence (NC_000017.10) which are Chr17:41196312..41277500. This appears to be because the HGVbaseG2P coordinates are taken from a different assembly of chromosome 17. As shown in Figure 14 the position 38497955 was used from genome build 36.3 but is not in 37.1.
Figure 14: Position of rs4986852 SNP in builds of dbSNP. Selecting “Results” returns a list of 10 studies. Because the –log p value is set to ≥ 0 all 10 studies in which this SNP was used as a marker are returned. The studies (shown in Figure 15) are by Strachan et al (2007).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
32/58
Figure 15: Part of a list of 10 studies associated with the BRCA1 SNP rs4986852. The unadjusted P-value for rs4986852 in each of the ten studies is shown on the left column of the table. The most significant P-value is for a study investigating adult body mass index. It is not clear how a user should interpret this result. Clearly BRCA1 variants are associated with breast and ovarian cancer not body mass index. However, this marker has not been used for any studies of these cancers. Furthermore, the design of GWAS experiments for breast/ovarian would probably exclude individuals with BRCA1 and BRCA2 mutations. This is the case for the breast cancer study included in HGVbaseG2P by Kibriya et al., (2009)
8.3.2 Searching HGVbaseG2P using a region.
The database was searched using the coordinates for BRCA1 (41196312..41277500). As described above, these will not be the correct coordinates for the genome build used in the current version of HGVbaseG2P, but serve as an example. The search returns 28 markers for this region. By increasing the significance threshold to – log P ≥ 3 (i.e. P value ≤ 0.001) the number of markers is reduced to 10 (see Figure 16).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
33/58
Figure 16. Searching HGVbaseG2P for markers with P-values less than 0.001 using region 41196312 to 41277500 of chromosome 17. The P-values of the 10 SNPs cannot be viewed because of the possibility that individuals could be identified. However it is possible to deduce that the values must be between 0.001 and 0.0001 since setting the P-value threshold to –logP ≥ 4 does not return any markers.
8.3.3 Searching HGVbaseG2P using a gene name.
The result of searching HGVbaseG2P with BRCA1 is shown in Figure 17. There are 39 markers associated with this gene and that these markers have been used in 28 studies.
Figure 17: Searching HGVbaseG2P for markers using BRCA1 text search There is one study with a keyword match to BRCA1. This is a study of individuals with breast cancer who do not have deleterious mutations in BRCA1 or BRCA2 (Kibriya et al., 2009). None of the BRCA1 gene markers have significant P-values in this study. There are only two studies in which BRCA1 markers have a P-value less than 0.01, these studies are investigating height in the British population and Parkinson’s disease (see Figure 18).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
34/58
Figure 18: GWAS studies associated with markers in the BRCA1 gene. Issues • The issue of changing coordinates in different genome assemblies may cause problems for users. It might be useful if the genome assembly being used by HGVbaseG2P was shown on the website. • Clinical scientists will probably be more used to interpreting odds ratios rather than Pvalues. It may be useful to provide some additional information to aid interpretation. In addition, it is not obvious whether the P-values indicate that markers are associated with increased or decreased risks.
8.4
GEN2PHEN Knowledge Centre
The knowledge centre (KC) provides a central platform for access to genotype to phenotype data with specialist knowledge. The KC will be used to disseminate information about the GEN2PHEN project, provide access to project deliverables, GEN2PHEN training tools and G2P resource lists.
8.4.1 Locating GEN2PHEN resources in the Knowledge Centre.
GEN2PHEN users can locate resources from the home page of the KC. There are a series of tabs and drop down menus. Figure 19, shows the user selecting GEN2PHEN deliverables.
Figure 19: Selecting GEN2PHEN resources using the Knowledge Centre Finding other GEN2PHEN resources is not so straightforward. There does not appear to be any obvious link to the GEN2PHEN training tools or the resource list. Links are given to these in the recently submitted GEN2PHEN paper (Webb et al., 2011) but it would be useful
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
35/58
to have a clear link on the homepage. In addition, the link to the GEN2PHEN resources given in the paper (http://www.gen2phen.org/resources) appears to be broken (404 error). Using the link to the GEN2PHEN training page (http://www.gen2phen.org/training) provided in the paper allows the user to access the training tools which have been provided to date (see Figure 20).
Figure 20: Identifying GEN2PHEN training resources using the Knowledge Centre Two of the tools, “Adding a custom log to LOVD” and “Submitting mutation data to the OI and EDS database” are not accessible without registering as a GEN2PHEN user. It is not clear why this need be the case and it might be better if the same policy was applied to all training materials. Also requiring a user to register and login might deter them from using these resources.
8.4.2 Using GEN2PHEN data via the Knowledge Centre
Selecting the GEN2PHEN data tab on the homepage allows the user to access data sets that have been generated by the project. So far this is a single set from an analysis of LSDBs produced by Mitropoulou et al (2010) generated as part of deliverable 2.3. LSDBs can be searched for using gene name or database name. This tool could be useful for clinical scientists looking to find all the available LSDBs for a gene. Searching for BRCA1 produces one hit, the BRCA1 publication database (see Figure 21). The BIC database, BRCA1 classification database and UMD BRCA1 database are not returned. The search also supplies information about the last update of the database. However, this is incorrect; the publication database was last updated in November 2010.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
36/58
Figure 21: Identifying BRCA1 LSDBs using the Knowledge Centre Using BRCA2 produces two hits (see Figure 22), but these are to the same Fanconi anaemia database. The first LSDB in the list only provides a link to the new version of the Fanconi anaemia database which has been developed using LOVD.
Figure 22: Identifying BRCA1 LSDBs using the Knowledge Centre This search also did not return the UMD BRCA2 database or the BIC database. It also did not return the BRCA2 publication database. In contrast the HGVS website also allows users to search for LSDBs (http://www.hgvs.org/dblist/glsdb.html). As Figure 23 shows, HGVS for BRCA1 and BRCA2 LSDBs returns more hits, but still excludes the UMD BRCA2 database and the BRCA1 classification database and the Zhejiang University URLs no longer work.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
37/58
Figure 23: Identifying BRCA1 and BRCA2 LSDBs using the HGVS website. The problem for both the GEN2PHEN and HGVS LSDB search tools is that they are returning LSDBs from a either a spreadsheet or database and in both cases this is incomplete and out of date. Therefore both miss databases, and the information about them (URLs and last updates) can be incorrect. The situation could be improved by merging the HGVS and GEN2PHEN list but the problem of returning a search from an out-of-date database will remain.
8.5
Obtaining feedback from the clinical science community.
A set of online surveys were developed for GEN2PHEN software. The aim of these surveys was to advertise software to the user community and to obtain feedback that might help guide future development. The surveys were sent to the 115 members of the ENIGMA consortium, who are investigating unclassified BRCA1 and BRCA2 variants, and to NGRL mailing list which is sent to more than three hundred individuals working in clinical genetics. The surveys were kept short in the hope that recipients would complete more than one survey. Unfortunately the response to these surveys was very poor. Despite contacting several hundred individuals who might be expected to use GEN2PHEN software, the largest response we received for any piece of software was eight. This lack of response is similar to that at the 2010 BSHG conference, where a GEN2PHEN stand was set up and handouts for software were made available. It raises questions as to how we should go about publicising GEN2PHEN and obtaining feedback for software for the remainder of the project. For completeness the responses to the surveys are listed below, although clearly there was not a large enough response upon which to base any development decisions. Question 1: Had you heard of this software prior to this survey? Yes No HGVbaseG2P 2 3 Knowledge Centre 0 5 LOVD 7 1 WAVe 0 3
Question 2: Did you find this software easy to use?
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
38/58
1 (difficult) 2 3 4 5 (easy)
HGVbaseG2P 0 0 2 2 0
KC 1 0 0 2 1
LOVD 1 0 1 5 1
WAVe 1 0 0 4 0
Question 3: Did you find the presentation of results easy to understand? 1 (difficult) 2 3 4 5 (easy) HGVbaseG2P 0 1 1 2 0 KC 1 0 1 1 1 LOVD 1 0 1 6 0 WAVe 1 0 0 1 1
Question 4: How useful is this software for analysing clinical data? 1 (not useful) 2 3 4 5 (very useful) HGVbaseG2P 0 1 1 2 0 KC 1 1 1 1 0 LOVD 1 0 1 5 1 WAVe 1 1 0 1 0
Question 5: Any further comments about this software? Feedback comments were only obtained for LOVD. These were “Too many data in a single row” “Question 4 not really relevant to me as don't analyse clinical data” (this was from the individual who gave the software a score of 1).
9
Generating LRGs for BRCA1 and BRCA2
New LRGs are produced by the EBI and NCBI in response to request from a user or consortium of users who require a LRG for reporting variants. As an example of how this works the process of requesting LRGs for BRCA1 and BRCA2 has been documented. This includes the analysis of existing reference sequences for BRCA1 and BRCA2 and the identifying differences between them, as well as consideration of the different splicing variants.
9.1 Available sequences.
BRCA1
and
BRCA2
LSDBs
and
reference
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
39/58
Several LSDBs exist for both BRCA1 and BRCA2. Table 1 lists these LSDBs and the reference sequences they used to describe variants. The LSDBs were found in the HGVS list (http://www.hgvs.org/dblist/glsdb.html), and from the LOVD list of databases and from our knowledge of the ENIGMA group. There may well be other not yet publicly available datasets that have used other reference sequences.
Location BRCA1 The UMD BRCA1 mutations database Chromium.liacs.n l URL Reference sequence Database Version
http://www.umd.be/BRCA1/
U14680.1
http://chromium.liacs.nl/LOVD2/cancer/h ome.php
Zhejiang University Centre for Genetic and Genomic Medicine brca.iarc.fr BIC BRCA2 The UMD BRCA2 mutations database LOVD BRCA2
http://www.chinahvp.org/LOVD/home.php?select_db=BR CA1 (this URL no longer works)
NG_005905.1 (Note: this is a RefSeqGene the transcript id it uses is NM_007294.3) NM_007294.2
BRCA1 091003
BRCA1 100114
http://brca.iarc.fr/LOVD/home.php?select _db=BRCA1 http://research.nhgri.nih.gov/bic/ http://www.umd.be/BRCA2/
U14680.1 U14680.1 U43746
BRCA1 091215
http://chromium.liacs.nl/LOVD2/cancer/ho me.php?select_db=BRCA2
Zhejiang University Centre for Genetic and Genomic Medicine Fanconi Anaemia Mutation Database (BRCA2) BIC
http://www.chinahvp.org/LOVD/home.php?select_db=BR CA2 (this URL no longer works)
NG_012772.1 (Note: this is a RefSeqGene the transcript id it uses is NM_000059.3) NM_000059.3
BRCA2 091003
BRCA2 100115
http://chromium.liacs.nl/LOVD2/FANC/ho me.php?select_db=FANCD1
NM_000059.1
FANCD1 080908
http://research.nhgri.nih.gov/bic/
NM_000059.1
Table 2: LSDBs and reference sequences for BRCA1 and BRCA2. Sequences used for clinical testing at the Regional Genetics Laboratory Services, Manchester are U14680 (BRCA1) and U43746 (BRCA2)
9.2
Comparison of Reference sequences.
Alignment of the predicted protein sequences provided by the BRCA1 reference sequences shows that there is no difference between them. For BRCA2 there are two differences: • At position 372 the H in NM_000059.1 and U43746 is replaced by N in NM_000059.3.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
40/58
•
At position 599 the F in NM_000059.1 and U43746 is replaced by S in NM_000059.3.
Comparison of DNA sequences show that there are several differences between BRCA1 reference sequences: • According to their annotation, the first exon of NM_007294.2 is exon 1B, while for NM_007294.3 it is exon 1A. • U14680 sequence does not extend as far into the 5’ UTR as the other reference sequences and does not have any of the 3’ UTR. The final codon in U14680.1 is the stop codon. • Alignment of the first exon of U14680.1 to the RefSeqGene NG_5905.1 shows that there are 4 mismatches and 1 gap. This exon does not contain any of the open reading frame. There are also differences between the BRCA2 reference sequences. Alignment to the RefSeqGene NG_012772.1 indicates: • Both NM_000059.1 and U43746 contain two exons 27 (87683-88731) and 28 (8913089190). In NM_000059.3 these are replaced by a single exon 27 (87683-89193). • Both NM_000059.1 and U43746 contain 8 mismatches and 1 gap. Two of the mismatches cause the two amino acid mismatches discussed above. There is a further change within the ORF, at position 4791 there is a G instead of an A. The remaining differences are in the 5’ and 3’ UTR.
9.3
Alternative Splicing
There are no reports of alternate BRCA2 transcripts in either Ensembl or RefSeqGene and no publications were identified in PubMed. However, there is alternative splicing of BRCA1. Table 3 has been taken from a review of BRCA1 alternative splicing by Orban and Olah (2003). The paper lists a number of splice variants which have been identified. As noted by the authors there are problems with identifying alternate transcripts since many publications detail aberrant splicing by pathogenic mutations. Since the publication, several of these sequences have been suppressed by Genbank because “only partial transcript evidence exists for this transcript variant, and its full-length exon combination is unclear”. However, it could be that there are other alternate transcripts missing from Table 3. For example, Ensembl lists a total of thirty transcripts which are predicted to encode proteins.
Name of the variant Full length BRCA1 With exon 1a (NM_007294) With exon 1b (NM_007295) D(2–10) (NM_007297) D(9,10) (NM_007302) D(9,10,11q) (NM_007305) D(9,10,11) ORF maintained? Yes Yes Yes Yes Yes Yes Yes Tissues Breast, ovary, testis, thymus, various other Breast, ovary, testis, thymus Placenta Breast, lymphocytes Breast, ovary, lymphocytes Breast, ovary, lymphocytes Breast, lymphocytes Comment
Still exists, transcript variant 1 Suppressed † Still exists, transcript variant 3 Suppressed † Suppressed Still exists, transcript variant
†
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
41/58
(NM_007298) D(11q) (NM_007304) D(11) (NM_007303) D(14–17) (NM_007299) D(14–18) (NM_007300) D(15–17) (NM_007301)
Yes Yes Yes Yes No
Breast, ovary, lymphocytes Ovary, thyroid Breast, lymphocytes Breast, lymphocytes Breast, lymphocytes
4 Replaced by NM_007298. Suppressed † Still exists, transcript variant 5 Still exists, transcript variant 2 Record removed. Record removed.
-6 nt from 3’ of exon 1a (NM_007296)
Yes
Kidney, lung, other
Table 3: BRCA1 alternate transcripts. This table has been adapted from Orban and Olah 2003. The comments column, which details whether a GenBank accession still exists, has been added for this report. The original table also featured other transcripts which did not † have Genbank accession numbers. Transcripts are permanently suppressed because only partial transcript evidence exists for the transcript variant, and its full-length exon combination is unclear (from NCBI site).
9.4
Alignment to genomic reference sequence.
In order to determine the extent to which the reference sequences and alternative transcripts overlap they were aligned to the RefSeqGene using Spidey (http://www.ncbi.nlm.nih.gov/spidey/). Predicted exon start and end positions of the transcripts are shown in Table 4. Points to note: • Although NM_007294.2 and NM_007294.3 are supposed to begin with exons 1B and 1A respectively, this is not clear from our alignments. Both exons finish at the same position (10691). In contrast, two other transcripts that begin with exon 1B, NM_007297.3 and NM_007299.3, finish at 10685. • There are cases (labelled red) where Spidey appears to have misaligned the exon slightly. • There are also apparently real variations in some exon boundaries (labelled blue). • There is an addition exon in NM_007300.3 which does not exist in other transcripts. Table 5 shows the alignment of the three BRCA2 LSDB reference sequences to the RefSeqGene NM_000059.3. There is one major difference; NM_000059.3 does not have exon 28. Instead exon 27 spans the region covered by exons 27 and 28 in the other reference sequences.
9.5
Annotation of transcripts.
Table 6 shows how the transcripts in the BRCA1 reference sequences and alternative transcripts have been annotated in their respective Genbank files. This has not been done for BRCA2 because two of the three Genbank files do not include annotation of exons. As the table shows equivalent exons can have different annotation in different transcripts. This may not be important, since HGVS nomenclature does not include exon numbers. However, exon numbers are used by clinical scientists and researchers when discussing sequencing experiments and mutations and also to catalogue variant data e.g. in databases. Therefore
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
42/58
there is potential for confusion when they are mapping their existing knowledge onto LRGs. Points to note are: • The exons of U14680.1 are labelled sequentially 1-24 with exon 4 missing. This is because the original exon 4 is now considered to have been a cloning artefact (see for example Brose et al., 2004). • In other transcripts (other than NM_007294.2) there are exons 4a (in NR_027676.1) or 4b which are equivalent to exon 5 in U14680.1 and NM_007294.2. Possibly the “a” and “b” have been used to avoid confusion with the original exon 4. • NM_007294.2 does have an exon 4 which is equivalent to exon 3 in other transcripts. The reason for this is that NM_007294.2 does not have an exon annotated as “exon 2”. This reflects the previous annotation of other transcripts. For example, the first exons of transcript NM_007300.2 are labelled 1b, 3, 4, 5. • Exon 11 in U14680.1 is equivalent to exons 10b in some reference sequences and 11b in NM_007294.2. There is also an exon 10a which is found in alternative transcripts. • Exon 11 in U14680.1 is equivalent to exons 10b in some reference sequences and 11b in NM_007294.2. There is also an exon 10a which is found in alternate transcripts NM_007298.3 and NM_007299.3. • Exon 14 in U14680.1 is equivalent to exon 14a. There is also an exon 14b which is found in alternate transcripts NM_007298.3, NM_007299.3 and NM_007300.3.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
43/58
Exon 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
NM_007294.2 Start end 10511 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91665
NM_007294.3 start End 10479 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007297.3 Start End 10511 10685 11847 11945 29429 31006 31701 36082 38672 40040 41102 44930 53387 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 29506 31094 31840 36186 38718 40116 44527 45018 53558 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007298.3 start end 11847 20183 29429 31006 31701 36082 38672 40040 41102 44929** 53387 59351* 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 11945 20236 29506 31094 31840 36186 38718 40116 41217* 45018 53558 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007299.3 start end 10511 10685 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 41217* 45018 44929** 53387 53558 59351* 61441 64724 68267 72011 72589 78827 84845 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 88319 91667
NM_007300.3 start end 10479 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 56563 56628 59351* 59474 61441 61631 64724 65034 68267 68354 72011 72088 72589 72629 78827 78910 84845 84899 86768 86841 88259 88319 90160 91667
NR_027676.1 start end 10639 10779 11846** 11945 20183 20236 29429 29484* 31006 31094 31701 31840 36085 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
U14680.1 start end 10593 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 90284
Table 4: Exon start and end positions generated by Spidey alignment of alternative transcripts against RefSeqGene NG_005905.1. Figures in red (with differences in exon positions that might be due to alignment errors. Figures in blue (with *) show real alternate start or end positions for exons.
**)
indicate small
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
44/58
EXON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
U43746.1 Start 5000 5943 8598 14597 15622 15763 16020 18964 20440 21793 25786 34079 36348 44382 45949 47263 52044 52700 59922 60477 66191 68838 69271 69528 84210 86419 87683 89130
end 5188 6048 8846 14705 15671 15803 16134 19013 20551 22908 30717 34174 36417 44809 46130 47450 52214 53053 60078 60621 66312 69036 69434 69666 84454 86565 88731 89190
NM_000059.1 start end 5000 5188 5943 6048 8598 8846 14597 14705 15622 15671 15763 15803 16020 16134 18964 19013 20440 20551 21793 22908 25786 30717 34079 34174 36348 36417 44382 44809 45949 46130 47263 47450 52044 52214 52700 53053 59922 60078 60477 60621 66191 66312 68838 69036 69271 69434 69528 69666 84210 84454 86419 86565 87683 88731 89130 89190
NM_000059.3 start End 5001 5188 5943 6048 8598 8846 14597 14705 15622 15671 15763 15803 16020 16134 18964 19013 20440 20551 21793 22908 25786 30717 34079 34174 36348 36417 44382 44809 45949 46130 47263 47450 52044 52214 52700 53053 59922 60078 60477 60621 66191 66312 68838 69036 69271 69434 69528 69666 84210 84454 86419 86565 89193 * 87683 *
Table 5: Exon start and end positions generated by Spidey alignment of alternative transcripts against RefSeqGene NM_000059.3. Figures in blue (with *) show alternate start or end positions for exons
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
45/58
Exon number from Spidey alignment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
NM_007294.2 1b 3 4 5 6 7 8 9 10 11b 12 13 14a 15 16 17 18 19 20 21 22 23 24
NM_007294.3 1a 2 3 4b 5 6 7a 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
NM_007297.3 1b 2 4b 5 6 7a 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
NM_007298.3 2 3 4b 5 6 7a 8 9 10a 11 12 14b 15 16 17 18 19 20 21 22 23 24
NM_007299.3 1b 2 3 4b 5 6 7a 8 9 10a 11 12 14b 15 16 17 18 19 20 21 23 24
NM_007300.3 1a 2 3 4b 5 6 7a 8 9 10b 11 12 13 14b 15 16 17 18 19 20 21 22 23 24
NR_027676.1 1c 2 3 4a 5 6 7b 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
U14680.1 1 2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Table 6: Alternative strategies for numbering exons in BRCA1 reference sequences.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
46/58
9.6
• •
Overview of BRCA1 and BRCA2 reference sequences.
There are LSDBs for both BRCA1 and BRCA2 that are using out of date reference sequences to describe variants. The same applies to sequences used for clinical testing. The use of the out of date sequences is part of the SOP used for analysis of these genes. Despite the fact that the annotation of NM_007294.2 states that it starts with exon 1B, this does not appear to be the case. Instead the difference between NM_007294.2 and NM_007294.3 appears to be that NM_007294.3 extends further into the 5’ UTR. There are minor differences between the 5’UTR of U14680.1 and the other BRCA1 reference sequences. Possibly these could be annotated in the updateable section of the LRG. There are at least five alternative transcripts for BRCA1. However, these do not appear to be used for describing variants, so there may be no need to include them in the LRG. However, it may be useful for researchers to know that variants that they are describing as exonic are in fact intronic in alternative transcripts. For example, if a variant used a stop codon or frameshift but is in a exon that is not included in all transcripts, this information might be considered significant by clinical scientists. However, the process of assessing alternative transcripts is not necessarily straightforward, the positions of exon start and stops provided in Genbank files may not be correct. For BRCA2 there are 399 intronic bases between exons 27 and 28 in U43746.1 and NM_000059.1 which are exonic in NM_000059.3. This does not alter the ORF but means that there could be problems integrating variants in this region Because there are no gaps within the coding regions of BRCA1 and BRCA2 alignments, the HGVS nomenclature given to the variants within the ORFs should be largely consistent between LSDBs and data from clinical labs.
• •
• • •
9.7
Survey of reference sequence users
Potential users of BRCA1 and BRCA2 LRGs were surveyed to identify which reference sequences they were using and whether they would be willing to use a LRG. The survey was sent out to the same groups as the GEN2PHEN software surveys discussed in section 8.5 of this report. In this case the response was rather better; thirty four responses were received despite this survey being rather longer than the software surveys. The questions and responses were as follows: Questions 1 and 2 concerned the names and organisations of the respondents. As well as replies from UK diagnostic labs, there were responses from Italy, Denmark, Finland, Czech Republic, Netherlands, Greece, Australia, France, USA, Norway and Spain.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
47/58
Question 3: Which BRCA1 reference sequence do you use? Reference sequence U14680 NM_007294.2 NM_007294.3 Another sequence Not answered Frequency 20 6 1 2 5
Question 4: If you answered “Another reference sequence” please enter the id number. “Previously to 2009 was used U14680, systematic HGVS nomenclature recommendations Ensembl: ENSG00000012048”. “L78833” Note, these are both genomic reference sequences. Question 5: Which BRCA2 reference sequence do you use? Reference sequence U43746 NM_000059.1 NM_000059.3 Another sequence Not answered Frequency 13 7 6 4 4
Question 6: If you answered “Another reference sequence” please enter the id number. “OTTHUMG00000017411 V.2” “Do not test BRCA2 in lab, but NM_000059.3 in some regions.” “Previously to 2005 was used U43746.1, from 2005 to 2008 NM_000059.1. systematic HGVS nomenclature recommendations Ensembl: ENSG00000139618” “AY436640” Note, as with BRCA1 these are genomic reference sequences. Question 7: If a LRG reference sequence was created for BRCA1 using NM_007294.3 would you migrate to using this sequence? Yes No Not answered Frequency 23 8 3
Question 8: If a LRG reference sequence was created for BRCA2 using NM_000059.3 would you migrate to using this sequence? Yes No Not answered
© Copyright 2011 GEN2PHEN Consortium
Frequency 23 9 2
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
48/58
Question 9: If you answered no; for questions 7 or 8, please state any reasons you have for not using LRGs. “I've answered yes but would prefer to say probably. This would depend on there being a consensus among the community that the LRG refseq was the standard to use” “This would require a significant amount of work for us. We would consider if it became the industry standard.” “Would do if it became a requirement” “The reference sequences we use are quoted on every document we use relating to BRCA analysis, and all our patient reports. To change the sequence would be a document control nightmare as everything would need updating to include the new accession numbers! Also any differences between the LRG sequence and the one we use (even intronic and/or irrelevant differences) would have to be changed on all analysis reference files used in the analysis and paper templates in the lab. This would take a huge amount of time and would probably of little or no value to the service. So unless the current sequences we use were shown to be inaccurate we would not change them.” “I think we should migrate to HGVS numbering” “I would have answered don't know had that been an option, as we are not currently familiar with them – however the accommodation of alternative amino acid and exon numbering systems could be useful, particularly for genes such as NF1 (exons) and MUTYH (amino acids)” “My answer for questions 7 and 8 is I don't know. I hadn't heard of LRG until now. It does seem like a good idea to standardise the reference sequences being used by UK diagnostic labs though.” “Do not test BRCA2 in lab” “These are the most frequently and first used sequences for BRCA1 and BRCA2 mutations description in the genetic test reports in Italy and world-wide as well as in the international reference database (BIC). Keeping of these reference sequences will ensure consistency in calls of the same mutation amongst different genetic testing Centres.” “We use the reference sequence used by the genetic testing laboratory, Myriad Genetics, since all of our variants come to us with that nomenclature and we don't need to figure out the other nomenclature for them.” “Probably yes, but it depends on the choice of the international scientific community” “We use the most up to date reference sequence. Need further information as to the benefit of having an LRG reference sequence before we alter our current process”.
9.8
Design of BRCA1 and BRCA2 LRGs.
The decision as to which sequences are to be included in a LRG is made by the LRG requesters. The minimum requirements for a LRG are listed at http://www.lrgsequence.org/page.php?page=contributions. It is intended that the owner of the BRCA1 and BRCA2 LRGs will be Larry Brody at the NHGRI, who is involved in the curation of the BIC database. Currently we are awaiting agreement between Larry Brody and NCBI as to the content of these LRGs. The LRGs for BRCA1 and BRCA2 will be based upon the RefSeqGene sequences NG_005905.2 (transcript id NM_007294.3) and NG_012772.1 (transcript id NM_000059.3). This has the advantages firstly that the LRG sequences will reflect our most up-to-date
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
49/58
knowledge of the BRCA genes; and secondly the LRG and RefSeqGenes will be in agreement. However, it also means that the reference sequences used by other LSDBs will be different from that used for LRG. In addition, as the results of the survey show, it will require some effort for diagnostic users to modify SOPs in order to use the LRGs. It is not clear how those who have been using different reference sequences will manage the process map their existing variants onto the LRG sequences. Although the analysis of BRCA1 alternate transcripts did indicate that there were different exon boundaries as well as alternate use of exon numberings the alignment of the sequences used by diagnostic laboratories and by LSDBs shows that they use the same splice product. While LRGs are designed to be able to store information on alternate transcripts, this is intended for use when the alternate transcripts are “necessary and used for reporting mutations and diagnostic purposes” (from minimal information set guidelines). In the case of BRCA1, while alternate transcripts are not used for reporting mutations, they might be useful for diagnostic purposes and so it is not clear whether or not they should be included in the LRG. There are also alternative numberings of the exons in different reference sequences which could be usefully stored in the LRG.
10
Summary
The aim of the pilot studies are to “objectively track progress and deficiencies in the GEN2PHEN project” and “provide assessment of the usefulness of the system in a ‘real-lifelike scenario’” (quoted text is from the GEN2PHEN Annex I - “Description of Work”). This report describes the ways in which clinical scientists in genetic testing laboratories work with G2P data in order to evaluate variant data. This can be summarised as follows: • Clinical scientists need to be able to integrate lines of evidence in order to produce an interpretation for a variant. • There is no standard method for integrating evidence but the UK Clinical Molecular Genetics Society’s guidelines for the interpretation of unclassified variants detail the evidence that should be considered. In addition, standards are being developed for LSDBs (Greenblat et al., 2008), the description of variant pathogenicity (Plon et al., 2008) and evidence regarding splice site prediction (Spurdle et al., 2008). • GEN2PHEN software will provide mechanisms to allow data to be shared and integrated. Will the data standards that are being developed be able to deal with issues of different LSDBs providing different assessments of pathogenicity? Data standards can ensure that LSDBs use the same terms to describe pathogenicity (e.g. the IARC classification terms discussed in Section 3.1). However, they may not be able to influence the quality of pathogenicity assessments, as shown by the variability of the published assessments discussed in Section 4. As well as data standards, there may be a need for data curation standards, to either remove or highlight lower-quality assessments. This may be less of an issue for diagnostic data from genetic testing laboratories, where the adherence to an SOP should ensure the quality of the assessment. However, it is also the case that since this data is unlikely to be published in a journal; other users will be less able to make their own assessments of dataquality.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
50/58
• • •
•
•
•
Will the development of data standards prevent and the same variant identification being integrated more than once? Clinical scientists analyse variants according to SOPs and in to order have GEN2PHEN software adopted by the diagnostic community, the software will have to find a place within the SOPs. It was our intention to use feedback from the clinical community about GEN2PHEN software but this has not been easy to obtain, either by emailed questionnaires or by face to face discussions at conferences. We need to consider how best to go about engaging with this community in future, both to get feedback about GEN2PHEN and also to publicise the software and get it adopted. The reason for the lack of feedback might be explained by the fact that some of the GEN2PHEN software is still in development and there is as yet no user community. Possibly we need to establish a “critical mass” of users in order to be able generate feedback. However, we should also consider whether there is a mismatch between the needs of the diagnostic/clinical users and what some of the GEN2PHEN tools aim to do. Clinical scientists are clearly not the only users of G2P data and it may be that other users simply need a list of variants within a gene without supporting evidence for pathogenicity. However, it is also the case that the clinical community generates large amounts of data and that at present most of this data is not made available to the wider community. In order to encourage the sharing of this data it is important that the tools available are attractive to the community and demonstrate clear benefits. At present, the integration of G2P is a manual process performed by clinical scientists on a “one variant at a time” basis. The development of NGS based clinical genetics will mean that there will need to be far more automation of this process by analysis pipelines. We need to give consideration as to how GEN2PHEN software will function within these pipelines.
10.1 Barriers to data integration and GEN2PHEN solutions
The vision of the GEN2PHEN project is shown in figure 24, reproduced from the GEN2PHEN Description of Work document (Figure 2, page 11).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
51/58
Figure 24: G2P Databases, Current and Future As stated in the Description of Work the problem with G2P databases is that there is “no convenient way to populate the databases, no easy way to exchange or compare or integrate the different resources, and absolutely no way to search the totality of gathered information. It is fragmented, disorganised, and highly inefficient.” GEN2PHEN will develop “a broad array of G2P databases (shown in dotted outlines), all constructed from common principles and standards via open-source software (hence all uniformly coloured white), so enabling widespread interconnectivity in the resulting ‘G2P Knowledge Network’. We will take various measures to bring the existing G2P databases into this network, especially current LSDBs.” For the BRCA1 and BRCA2 databases discussed in this Pilot Study, there are several barriers to data integration: • How are LSDB resources identified? As discussed in this report, maintaining an upto-date LSDB listing is a difficult task. • Is data suitably available for integration? Several of the BRCA databases are password protected and the curators may not wish to make the data publicly available. • The HGVS guidelines for naming variants have not been followed by several databases and reformatting data will be necessary. • How will the use of different reference sequences to describe variants affect integration? For example, do U14680.1:c.211A>G and NM_007294.3:c.211A>G refer to the same variant? At present, in order to be certain that they are the same, the two reference sequences need to be aligned and the region around 211A compared. • How can data relating to the pathogenicity of variants be combined if (1) LSDBs are use unrelated terms to describe pathogenicity (e.g. “1-5”, “+,- or ?”, “yes, no or uncertain”); and (2) if standards for describing the probability of a variant being pathogenic are not used?
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
52/58
•
How do we avoid data being integrated more than once? For example, there are several LSDBs for the HNPCC gene MLH1. The data from the MMR database has been included in dbSNP. It has also been added to the InSiGHT database. If InSiGHT data were also to be added to dbSNP, how would the re-inclusion of variants already in dbSNP be prevented?
To what extent will GEN2PHEN software help to overcome these barriers? • As mentioned in Section 8.2, GEN2PHEN is developing a unified list of LSDBs, we need to establish how complete this list is, and how easy it is to keep it up to date. If new LSDB resources are established, how will GEN2PHEN become aware of them? • The development of LRGs will provide an unchanging reference to describe variants with. To what extent will LRGs help the problem of integrating sequences identified using alternate reference sequences? • The VarioML model includes details of a variant’s pathogenicity. There VarioML website (https://svn.gene.le.ac.uk/gen2phen/trunk/data_formats/xml/html/doc.html) mentions an evidence ontology. How will the non-standard descriptions of variant pathogenicity in different databases be modelled using VarioML? • Is the problem of variants being re-integrated into repositories one that can be solved using data identifiers, of the type being develop to identify researchers?
11
References
Aarnio M, Sankila R, Pukkala E, Salovaara R, Aaltonen LA, de la Chapelle A, Peltomäki P, Mecklin JP, Järvinen HJ. 1999. Cancer risk in mutation carriers of DNA-mismatch-repair genes. Int. J. Cancer. 81: 214-8. Auranen A, Song H, Waterfall C, Dicioccio RA, Kuschel B, Kjaer SK, Hogdall E, Hogdall C, Stratton J, Whittemore AS, Easton DF, Ponder BA, Novik KL, Dunning AM, Gayther S, Pharoah PD. 2005. Polymorphisms in DNA repair genes and epithelial ovarian cancer risk. Int. J. Cancer. 117: 611-8. Bilgüvar K, Oztürk AK, Louvi A, Kwan KY, Choi M, Tatli B, Yalnizoğlu D, Tüysüz B, Cağlayan AO, Gökben S, Kaymakçalan H, Barak T, Bakircioğlu M, Yasuno K, Ho W, Sanders S, Zhu Y, Yilmaz S, Dinçer A, Johnson MH, Bronen RA, Koçer N, Per H, Mane S, Pamir MN, Yalçinkaya C, Kumandaş S, Topçu M, Ozmen M, Sestan N, Lifton RP, State MW, Günel M. 2010. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature. 467: 207-10. Brose MS, Volpe P, Paul K, Stopfer JE, Colligon TA, Calzone KA, and Weber BL. 2004. Characterization of Two Novel BRCA1 Germ-Line Mutations Involving Splice Donor Sites. Genetic Testing 8:133-138 Burk-Herrick A, Scally M, Amrine-Madsen H, Stanhope MJ, Springer MS. 2006. Natural selection and mammalian BRCA1 sequences: elucidating functionally important sites relevant to breast cancer susceptibility in humans. Mamm. Genome. 17: 257-70.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
53/58
Chakravarti A. 1999. Population genetics-making sense out of sequence. Nat. Genet. 21: 5660. Chiu RW, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, Lun FM, Go AT, Lau ET, To WW, Leung WC, Tang RY, Au-Yeung SK, Lam H, Kung YY, Zhang X, van Vugt JM, Minekawa R, Tang MH, Wang J, Oudejans CB, Lau TK, Nicolaides KH, Lo YM. 2011. Noninvasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ Cirulli ET, Goldstein DB. 2010. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 11: 415-25. Costa FF. 2010. Epigenomics in cancer management. Cancer Manag. Res. 2:255-65. Cotton RG, Auerbach AD, Beckmann JS, Blumenfeld OO, Brookes AJ, Brown AF, Carrera P, Cox DW, Gottlieb B, Greenblatt MS, Hilbert P, Lehvaslaiho H, Liang P, Marsh S, Nebert DW, Povey S, Rossetti S, Scriver CR, Summar M, Tolan DR, Verma IC, Vihinen M, den Dunnen JT. 2008. Recommendations for locus-specific databases and their curation. Hum. Mutat. 29: 2-5. Cox DG, Kraft P, Hankinson SE, Hunter DJ. 2005 Haplotype analysis of common variants in the BRCA1 gene and risk of sporadic breast cancer. Breast Cancer Res. 7: R171-5. Díez O, Cortés J, Domènech M, Brunet J, Del Río E, Pericay C, Sanz J, Alonso C, Baiget M. 1999.BRCA1 mutation analysis in 83 Spanish breast and breast/ovarian cancer families. Int. J. Cancer 83: 465-9. Díez O, Osorio A, Durán M, Martinez-Ferrandis JI, de la Hoya M, Salazar R, Vega A, Campos B, Rodríguez-López R, Velasco E, Chaves J, Díaz-Rubio E, Jesús Cruz J, Torres M, Esteban E, Cervantes A, Alonso C, San Román JM, González-Sarmiento R, Miner C, Carracedo A, Eugenia Armengod M, Caldés T, Benítez J, Baiget M. 2003 Analysis of BRCA1 and BRCA2 genes in Spanish breast/ovarian cancer patients: a high proportion of mutations unique to Spain and evidence of founder effects. Hum. Mutat. 22: 301-12. FitzGerald MG, Marsh DJ, Wahrer D, Bell D, Caron S, Shannon KE, Ishioka C, Isselbacher KJ, Garber JE, Eng C, Haber DA. 1998. Germline mutations in PTEN are an infrequent cause of genetic predisposition to breast cancer. Oncogene. 17: 727-31. Fleming MA, Potter JD, Ramirez CJ, Ostrander GK, Ostrander EA. 2003 Understanding missense mutations in the BRCA1 gene: an evolutionary approach. Proc Natl. Acad. Sci. USA. 100: 1151-6. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW. 2010. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods. 7: 461-5.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
54/58
Goldgar DE, Easton DF, Byrnes GB, Spurdle AB, Iversen ES, Greenblatt MS; IARC Unclassified Genetic Variants Working Group. 2008. Genetic evidence and integration of various data sources for classifying uncertain variants into a single model. Hum. Mutat. 29: 1265-72. Goldstein DB. 2009. Common genetic variation and human traits. N. Engl. J. Med. 360: 1696-8. Gonzalez KD, Noltner KA, Buzin CH, Gu D, Wen-Fong CY, Nguyen VQ, Han JH, Lowstuter K, Longmate J, Sommer SS, Weitzel JN. 2009. Beyond Li Fraumeni Syndrome: clinical characteristics of families with p53 germline mutations. J Clin Oncol. 27: 1250-6. Górski1 B, Narod SA and Lubiński J. 2005. A common missense variant in BRCA2 predisposes to early onset breast cancer. Breast Cancer Research 7: R1023-R1027 Greely HT.2011. Get ready for the flood of fetal gene screening. Nature 469: 289-91. Greenblatt MS, Brody LC, Foulkes WD, Genuardi M, Hofstra RM, Olivier M, Plon SE, Sijmons RH, Sinilnikova O, Spurdle AB; IARC Unclassified Genetic Variants Working Group. 2008. Locus-specific databases and recommendations to strengthen their contribution to the classification of variants in cancer susceptibility genes. Hum. Mutat. 29: 1273-81. Greenman J, Mohammed S, Ellis D, Watts S, Scott G, Izatt L, Barnes D, Solomon E, Hodgson S, Mathew C. 1998. Genes Chromosomes Cancer. 21: 244-9. Guttmacher AE, McGuire AL, Ponder B, Stefánsson K. 2010. Personalized genomic information: preparing for the future of genetic medicine. Nat. Rev. Genet. 11: 161-5. Hearle N, Schumacher V, Menko FH, Olschwang S, Boardman LA, Gille JJ, Keller JJ, Westerman AM, Scott RJ, Lim W, Trimbath JD, Giardiello FM, Gruber SB, Offerhaus GJ, de Rooij FW, Wilson JH, Hansmann A, Möslein G, Royer-Pokora B, Vogel T, Phillips RK, Spigelman AD, Houlston RS. 2006. Frequency and spectrum of cancers in the Peutz-Jeghers syndrome. Clin. Cancer Res. 12: 3209-15. Hadjisavvas A, Charalambous E, Adamou A, Neuhausen SL, Christodoulou CG, Kyriacou K. 2004. Hereditary breast and ovarian cancer in Cyprus: identification of a founder BRCA2 mutation. Cancer Genet. Cytogenet. 151: 152-6. Janezic SA, Ziogas A, Krumroy LM, Krasner M, Plummer SJ, Cohen P, Gildea M, Barker D, Haile R, Casey G, Anton-Culver H. 1999 Germline BRCA1 alterations in a population-based series of ovarian cancer cases. Hum. Mol. Genet. 8: 889-97. Johnson N, Fletcher O, Palles C, Rudd M, Webb E, Sellick G, dos Santos Silva I, McCormack V, Gibson L, Fraser A, Leonard A, Gilham C, Tavtigian SV, Ashworth A, Houlston R, Peto J. 2007 Counting potentially functional variants in BRCA1, BRCA2 and ATM predicts breast cancer susceptibility. Hum. Mol. Genet. 16:1051-7.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
55/58
Kibriya MG, Jasmine F, Argos M, Andrulis IL, John EM, Chang-Claude J, Ahsan H. 2009. A pilot genome-wide association study of early-onset breast cancer. Breast Cancer Res. Treat. 114: 463-77. Lee TC, Lee AS, Li KB. 2008. Incorporating the amino acid properties to predict the significance of missense mutations. Amino Acids. 35: 615-26. McKean-Cowdin R, Spencer Feigelson H, Xia LY, Pearce CL, Thomas DC, Stram DO, Henderson BE. 2005. BRCA1 variants in a family study of African-American and Latina women. Hum. Genet. 116: 497-506. Matos S, Arrais JP, Maia-Rodrigues J, Oliveira JL. 2010. Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinformatics. 11: 212. Menzel HJ, Sarmanova J, Soucek P, Berberich R, Grünewald K, Haun M, Kraft HG. 2004 Association of NQO1 polymorphism with spontaneous breast cancer in two independent populations. Br. J. Cancer. 90: 1989-94. Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S, Liu Q, Cochran C, Bennett LM, Ding W, et al. 1994 A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 266: 66-71. Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. 2010. Locus-specific database domain and data content analysis: evolution and content maturation toward clinical use. Hum. Mutat. 31: 1109-16. Morris JR, Pangon L, Boutell C, Katagiri T, Keep NH, Solomon E. 2006. Genetic analysis of BRCA1 ubiquitin ligase activity and its relationship to breast cancer susceptibility. Hum. Mol. Genet. 15: 599-606. Need AC, Goldstein DB. 2010. Whole genome association studies in complex diseases: where do we stand? Dialogues Clin Neurosci. 12: 37-46. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42: 30-5. Orban TI, Olah E. 2003. Emerging roles of BRCA1 alternative splicing. Mol. Pathol. 56: 191-7. Pettigrew C, Wayte N, Lovelock PK, Tavtigian SV, Chenevix-Trench G, Spurdle AB, Brown MA. 2005. Evolutionary conservation analysis increases the colocalization of predicted exonic splicing enhancers in the BRCA1 gene with missense sequence changes and in-frame deletions, but not polymorphisms. Breast Cancer Res. 7: R929-39. Plon SE, Eccles DM, Easton D, Foulkes WD, Genuardi M, Greenblatt MS, Hogervorst FB, Hoogerbrugge N, Spurdle AB, Tavtigian SV; IARC Unclassified Genetic Variants Working
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
56/58
Group. 2008. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum. Mutat. 29: 1282-91. Rahman N, Seal S, Thompson D, Kelly P, Renwick A, Elliott A, Reid S, Spanova K, Barfoot R, Chagtai T, Jayatilake H, McGuffog L, Hanks S, Evans DG, Eccles D; Breast Cancer Susceptibility Collaboration (UK), Easton DF, Stratton MR. 2007. Nat. Genet. 39: 165-7. Read A and Donnai D. 2010. New Clinical Genetics, 2nd Edition. Scion Publishing Ltd Reich DE, Lander ES. 2001. On the allelic spectrum of human disease. Trends Genet. 17: 502-10. Ruffner H, Joazeiro CA, Hemmati D, Hunter T, Verma IM. 2001. Cancer-predisposing mutations within the RING domain of BRCA1: loss of ubiquitin protein ligase activity and protection from radiation hypersensitivity. Proc. Natl. Acad. Sci. U S A. 98: 5134-9. Santos C, Peixoto A, Rocha P, Vega A, Soares MJ, Cerveira N, Bizarro S, Pinheiro M, Pereira D, Rodrigues H, Castro F, Henrique R, Teixeira MR. 2009. Haplotype and quantitative transcript analyses of Portuguese breast/ovarian cancer families with the BRCA1 R71G founder mutation of Galician origin. Fam. Cancer. 8: 203-8. Schoumacher F, Glaus A, Mueller H, Eppenberger U, Bolliger B, Senn HJ. 2001. BRCA1/2 mutations in Swiss patients with familial or early-onset breast and ovarian cancer. Swiss Med. Wkly. 131: 223-6. Schrader KA, Masciari S, Boyd N, Wiyrick S, Kaurah P, Senz J, Burke W, Lynch HT, Garber JE, Huntsman DG. 2008. Hereditary diffuse gastric cancer: association with lobular breast cancer. Fam. Cancer. 7: 73-82. Seal S, Thompson D, Renwick A, Elliott A, Kelly P, Barfoot R, Chagtai T, Jayatilake H, Ahmed M, Spanova K, North B, McGuffog L, Evans DG, Eccles D; Breast Cancer Susceptibility Collaboration (UK), Easton DF, Stratton MR, Rahman N. 2006. Truncating mutations in the Fanconi anemia J gene BRIP1 are low-penetrance breast cancer susceptibility alleles. Nat. Genet. 38: 1239-41. Seymour IJ, Casadei S, Zampiga V, Rosato S, Danesi R, Falcini F, Strada M, Morini N, Naldoni C, Paradiso A, Tommasi S, Schittulli F, Amadori D, Calistri D. 2008. Disease family history and modification of breast cancer risk in common BRCA2 variants. Oncol. Rep. 19: 783-6. Spurdle AB, Couch FJ, Hogervorst FB, Radice P, Sinilnikova OM; IARC Unclassified Genetic Variants Working Group. 2008. Prediction and assessment of splicing alterations: implications for clinical testing. Hum. Mutat. 29: 1304-13. Strachan DP, Rudnicka AR, Power C, Shepherd P, Fuller E, Davis A, Gibb I, Kumari M, Rumley A, Macfarlane GJ, Rahi J, Rodgers B, Stansfeld S. 2007. Int. J. Epidemiol. 36: 52231.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
57/58
Szabo C, Masiello A, Ryan JF, Brody LC. 2000. The breast cancer information core: database design, structure, and scope. Hum. Mutat. 16: 123-31. Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, de Silva D, Zharkikh A, Thomas A. 2006 Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J. Med. Genet. 43: 295-305. Tavtigian SV, Byrnes GB, Goldgar DE, Thomas A. 2008. Classification of rare missense substitutions, using risk surfaces, with genetic- and molecular-epidemiology applications. Hum. Mutat. 29: 1342-54. Tommasi S, Pilato B, Pinto R, Monaco A, Bruno M, Campana M, Digennaro M, Schittulli F, Lacalamita R, Paradiso A. 2008 Molecular and in silico analysis of BRCA1 and BRCA2 variants. Mutat. Res. 644: 64-70. Vega A, Campos B, Bressac-De-Paillerets B, Bond PM, Janin N, Douglas FS, Domènech M, Baena M, Pericay C, Alonso C, Carracedo A, Baiget M, Diez O. 2001. The R71G BRCA1 is a founder Spanish mutation and leads to aberrant splicing of the transcript. Hum Mutat. 17: 520-1. Vissers LE, de Ligt J, Gilissen C, Janssen I, Steehouwer M, de Vries P, van Lier B, Arts P, Wieskamp N, Del Rosario M, van Bon BW, Hoischen A, de Vries BB, Brunner HG, Veltman JA. 2010. A de novo paradigm for mental retardation. Nat. Genet. 42: 1109-12. Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, Nord AS, Mandell JB, Swisher EM, King MC. 2010. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc. Natl. Acad. Sci. U S A. 107: 12629-33. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS, Samani NJ, Shields B, Prokopenko I, Farrall M, Dominiczak A; Diabetes Genetics Initiative; Wellcome Trust Case Control Consortium, Johnson T, Bergmann S, Beckmann JS, Vollenweider P, Waterworth DM, Mooser V, Palmer CN, Morris AD, Ouwehand WH; Cambridge GEM Consortium, Zhao JH, Li S, Loos RJ, Barroso I, Deloukas P, Sandhu MS, Wheeler E, Soranzo N, Inouye M, Wareham NJ, Caulfield M, Munroe PB, Hattersley AT, McCarthy MI, Frayling TM. 2008. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 40: 575-83. Wenham RM, Schildkraut JM, McLean K, Calingaert B, Bentley RC, Marks J, Berchuck A. 2003. Polymorphisms in BRCA1 and BRCA2 and risk of epithelial ovarian cancer. Clin. Cancer Res. 9: 4396-403.
© Copyright 2011 GEN2PHEN Consortium
HEALTH-F4-2007-200754 www.gen2phen.org
D1.5 Intermediate Report from Project Assessment Pilot
WP1 – Scientific Coordination
V1.2 Final
Lead beneficiary: ULEIC Date: 23/02/2011 Nature: Report Dissemination level: Public
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
2/58
TABLE OF CONTENTS DOCUMENT INFORMATION .............................................................................................4 DOCUMENT HISTORY ........................................................................................................4 DEFINITIONS .........................................................................................................................5 1 2 INTRODUCTION............................................................................................................6 ANALYSIS OF VARIANTS BY CLINICAL SCIENTISTS.......................................6 2.1 2.2 2.3 2.4 2.5 LOCUS-SPECIFIC DATABASES. .....................................................................................6 TESTING MATCHED CONTROLS....................................................................................7 CO-OCCURRENCE IN TRANS WITH KNOWN DELETERIOUS MUTATIONS. ........................7 CO-SEGREGATION WITH THE DISEASE IN THE FAMILY. ................................................7 OCCURRENCE OF A NEW VARIANT CONCURRENT WITH THE (SPORADIC) INCIDENCE OF THE DISEASE............................................................................................................................7 2.6 IN SILICO PREDICTIONS................................................................................................7 2.7 RNA STUDIES.............................................................................................................8 2.8 FUNCTIONAL STUDIES.................................................................................................8 2.9 LOSS OF HETEROZYGOSITY.........................................................................................8 2.10 PRESENCE OR ABSENCE IN SNP DATABASES. .............................................................8 2.11 INTEGRATING LINES OF EVIDENCE...............................................................................8 2.12 CLASSIFYING VARIANTS..............................................................................................9 2.13 REPORTING VARIANTS. ...............................................................................................9 3 BIOINFORMATICS RESOURCES FOR CLINICAL SCIENTISTS.......................9 3.1 LOCUS SPECIFIC DATABASES.......................................................................................9 3.2 DATABASES FOR THE ANALYSIS OF BRCA1 AND BRCA2 VARIANTS .............................10 3.2.1 UMD databases. .....................................................................................................10 3.2.2 LOVD databases. ....................................................................................................10 3.2.3 Breast Cancer Information Core (BIC) database...................................................13 3.2.4 Diagnostic Mutation Database (DMuDB)..............................................................14 3.2.5. Human Gene Mutation Database (HGMD)...........................................................14 3.2.6 Single Nucleotide Polymorphism Database (dbSNP).............................................16 3.3 BIOINFORMATICS TOOLS ...........................................................................................17 ISSUES ...................................................................................................................................17 4 EXAMPLES OF VARIANT ANALYSIS BY CLINICAL SCIENTISTS................18 4.1 BRCA1 U14680.1.C.211A>G..................................................................................18 4.1.1 In silico predictions..........................................................................................18 4.1.2 Database searches ...........................................................................................19 4.1.3 Publications .....................................................................................................19 4.1.4 Classification. ..................................................................................................20 4.2 BRCA1 U14680.1 C.1067A>G ................................................................................20 4.2.1 In silico predictions..........................................................................................20 4.2.2 Databases.........................................................................................................21 4.2.3 Publications .....................................................................................................21 4.2.4 Conclusion .......................................................................................................23
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
3/58
4.3 5 6
ISSUES.......................................................................................................................23
THE ENIGMA CONSORTIUM ..................................................................................23 ROLE OF NEXT GENERATION SEQUENCING IN CLINICAL SEQUENCING. 24 6.1 ISSUES .......................................................................................................................27 GENOME-WIDE ASSOCIATION STUDIES............................................................27 USE OF GEN2PHEN DELIVERABLES. ...................................................................28 8.1 MUTALYZER .............................................................................................................28 8.2 WAVE (WEB ANALYSIS OF THE VARIOME)..............................................................29 8.3 HGVBASEG2P .........................................................................................................31 8.3.1 Searching HGVbaseG2P using a SNP.............................................................31 8.3.2 Searching HGVbaseG2P using a region. ........................................................32 8.3.3 Searching HGVbaseG2P using a gene name...................................................33 8.4 GEN2PHEN KNOWLEDGE CENTRE..........................................................................34 8.4.1 Locating GEN2PHEN resources in the Knowledge Centre. ...........................34 8.4.2 Using GEN2PHEN data via the Knowledge Centre........................................35
7 8
8.5 OBTAINING FEEDBACK FROM THE CLINICAL SCIENCE COMMUNITY. .....37 9 GENERATING LRGS FOR BRCA1 AND BRCA2...................................................38 9.1 9.2 9.3 9.5 9.6 9.7 9.8 10 11 10.1 AVAILABLE BRCA1 AND BRCA2 LSDBS AND REFERENCE SEQUENCES. ................38 COMPARISON OF REFERENCE SEQUENCES .................................................................39 ALTERNATIVE SPLICING ...........................................................................................40 ANNOTATION OF TRANSCRIPTS. ................................................................................41 OVERVIEW OF BRCA1 AND BRCA2 REFERENCE SEQUENCES. .................................46 SURVEY OF REFERENCE SEQUENCE USERS .................................................................46 DESIGN OF BRCA1 AND BRCA2 LRGS...................................................................48 BARRIERS TO DATA INTEGRATION AND GEN2PHEN SOLUTIONS .............................50
SUMMARY ....................................................................................................................49 REFERENCES...............................................................................................................52
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
4/58
Document Information
Grant Agreement HEALTH-F4-2007-200754 Number Full title Project URL http://www.gen2phen.org Acronym GEN2PHEN
Genotype-To-Phenotype Databases: A Holistic Solution
EU Project officer Dr. Iiro Eerola (Iiro.EEROLA@ec.europa.eu) Deliverable Number 1.5 Title Intermediate Report from Project Assessment Pilot Title Scientific Coordination Month 30 Other Actual final 23/02/2011
Work package Number 1 Delivery date Status Nature Dissemination Level Report Public Contractual
Version 1.2 Final Prototype Confidential
Authors (Partner) UNIMAN Responsible Author Michael Cornell Partner UNIMAN Email michael.cornell@cmft.nhs.uk Phone 0044 (0)161 276 8716
Document History
Name Date Version Description
Michael Cornell Michael Cornell Michael Cornell
07/02/2011 20/02/2011 23/02/11
1.0 1.1 1.2
First Draft Amended following comments from reviewers Amended following comments from consortium
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
5/58
Definitions
Partners of the GEN2PHEN Consortium are referred to herein according to the following codes: ULEIC – University of Leicester (UK) – Coordinator EMBL – European Molecular Biology Laboratory (Germany) – Beneficiary FIMIM – Fundació IMIM (Spain) – Beneficiary LUMC – Leiden University Medical Centre (Netherlands) – Beneficiary INSERM – Institut National de la Santé et de la Recherche Médicale (France) – Beneficiary KI – Karolinska Institutet (Sweden) – Beneficiary FORTH – Foundation for Research and Tecnology Hellas (Greece) – Beneficiary CEA – Comissariat à l’Energie Atomique (France) – Beneficiary EMC – Erasmus Universitair Medisch Centrum Rotterdam (Netherlands) – Beneficiary UH.FGC – Helsingin Yliopisto (Finland) – Beneficiary UAVR – Universidade de Aveiro (Portugal) – Beneficiary UWC – University of the Western Cape (South Africa) – Beneficiary CSIR – Council of Scientific and Industrial Research (India) – Beneficiary SIB – Swiss Institute of Bioinformatics (Switzerland) – Beneficiary UNIMAN – The University of Manchester (UK) – Beneficiary BIOBASE – BioBase GmbH. (Germany) – Beneficiary deCODE – Islensk Erfoagreining EH (Iceland) – Beneficiary PHENO – Phenosystems S.A. (Belgium) – Beneficiary BCP – Biocomputing Platforms Ltd. Oy (Finland) – Beneficiary UPAT – University of Patras (Greece) – Beneficiary Grant Agreement: The agreement signed between the beneficiaries and the European Commission for the undertaking of the GEN2PHEN project (HEALTH-200754). Project: The sum of all activities carried out in the framework of the Grant Agreement by the Consortium. Work plan: Schedule of tasks, deliverables, efforts, dates and responsibilities corresponding to the work to be carried out for the GEN2PHEN project, as specified in Annex I to the Grant Agreement. Consortium: The GEN2PHEN Consortium, conformed by the above-mentioned legal entities. Consortium agreement: agreement concluded amongst GEN2PHEN participants for the implementation of the Grant Agreement. Such an agreement shall not affect the parties’ obligations to the Community and/or to one another arising from the Grant Agreement.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
6/58
1
Introduction
The aim of the second Pilot Study is to assess GEN2PHEN software from the perspective of an external user. The study particularly focuses on diagnostic laboratory data users, who have a particular interest in determining the pathogenicity of sequence changes. In addition, we consider research data users who are developing tests to identify disease-causing genetic variations; and members of ENIGMA consortium (http://enigmaconsortium.org/), a worldwide group which is accumulating evidence about variants of uncertain significance with the aim of classifying their involvement in predisposition to breast and ovarian cancer. In the main, this pilot focuses on variants in the BRCA1 and BRCA2 genes, which are associated with breast and ovarian cancer. The ways in which these potential GEN2PHEN users generate and analyse variant data is discussed and the potential impact of GEN2PHEN outputs on these processes considered. When considering the potential impact of software it is important to note that there could potentially be huge changes in the ways in which genetic testing is carried out in the near future. The incorporation of next sequencing technologies (NGS) into diagnostic sequencing is only just beginning to be piloted. However, it is clear that it will be possible to sequence more genes, much faster than traditional Sanger sequencing techniques allow. This, combined with the reducing cost of using NGS, will mean that far more clinical sequence data will also be generated. We therefore need to ensure that the technology developed by GEN2PHEN will be able to cope with next generation diagnostic sequencing.
2
Analysis of variants by clinical scientists
The role of the clinical scientist in a diagnostic laboratory is to perform genetic tests, such as DNA sequencing or MLPA (multiplex ligation-dependent probe amplification), and provide analysis of any variants identified by the test. The tests performed by clinical scientists can be diagnostic or predictive. In the case of diagnostic testing, the individual tested has a phenotype, such as breast/ovarian cancer and the test is carried out in order to try and determine any underlying genetic cause. It may be, for example in the case of BRCA1 and BRCA2 sequencing, that testing of family members may follow to determine whether they have the same genotype. In contrast, predictive testing might determine the likelihood of an individual going on to develop a phenotype. Predictive testing is used for pre-implantation genetic diagnosis (PGD) of chromosomal abnormalities such as trisomy 21. The recent development of tests based on next generation sequencing of fetal DNA in maternal blood (Chiu et al., 2011) may lead to many more PGD tests being developed (Greely, 2011). The process of analysing variants involves assessing multiple lines of evidence in order to produce an overall assessment of each variant’s pathogenicity. The guidelines developed by the UK Clinical Molecular Genetics Society (CMGS) and the Dutch Society of Clinical Genetic Laboratory Specialists for the interpretation and reporting of unclassified variants (the UV guidelines, see http://www.cmgs.org/BPGs/pdfs%20current%20bpgs/UV%20GUIDELINES%20ratified.pdf ) list the following lines of evidence that might be used to assess pathogenicity: 2.1 Locus-specific databases. According to the UV guidelines LSDBs should contain accurate (curated), clearly referenced data naming variants at the DNA, RNA and
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
7/58
protein level and include all relevant comments relating to the clinical interpretation of the variant. It is considered ideal if LSDBs allow for repeated submissions of the same variant, from different individuals, rather than recording each variant only once.
2.2 Testing matched controls. This involves comparing the frequency of occurrence of a variant in a healthy control group. For example Górski et al (2005) compared the frequency of the BRCA2 C5972T variant in 3,241 cases of breast cancer diagnosed at under 51 years of age, with 2,791 ethnically matched controls. The authors state that this variant predisposes individuals to breast cancer. Comparing different histologic subgroups they found that the effect was most pronounced in women who had ductal carcinoma in situ (DCIS) with micro-invasion (odds ratio = 2.8; p<0.0001). From discussions with clinical scientists it appears that although the authors describe C5972T as pathogenic, this piece of evidence would be treated “with caution” by a diagnostic laboratory. The odds ratio of 2.8 is fairly low and the DCIS phenotype is considered fairly benign. The UV guidelines stress the importance of considering patient ethnicity when using matched controls. The occurrence of many SNPs appears to vary across populations according to country of origin (for example, see Sven Bergmann’s analysis of the CoLaus cohort presented at ESHG 2010 https://secure.medacad.org/eshg.org/fileadmin/www.eshg.org/abstracts/ESHG2010Abstracts. pdf). 2.3 Co-occurrence in trans with known deleterious mutations. This
may be useful in analysis of BRCA1 variants. There is evidence that homozygotes and compound heterozygotes of BRCA1 pathogenic mutations are embryonically lethal. Therefore, an unclassified variant which is in trans with a known pathogenic mutation in a patient may be classified as non-pathogenic. Determining whether a variant occurs in trans with a pathogenic mutation may require sequencing of DNA from parents.
Co-segregation with the disease in the family. This approach is most useful as a means of excluding pathogenicity in cases where a variant does not segregate with a given disorder. However, the penetrance of the variant could be an issue. Also, an unclassified variant may only appear to be pathogenic because it is in cis with an unidentified pathogenic mutation. 2.4 2.5 Occurrence of a new variant concurrent with the (sporadic) incidence of the disease. The de novo occurrence of a variant in a strong candidate
disease gene concurrent with the sporadic incidence of the disease could be considered as strong evidence of pathogenicity. There are other factors which need to be considered: does the mutation affect mRNA splicing or amino acid sequence? Is it possible that what appears to be a de novo deletion may in fact be derived from a parent who is heterozygous for the deletion with the second chromosome carrying a duplication (e.g. spinal muscular atrophy) Might non-paternity be an issue?
2.6
In silico predictions. Diagnostic laboratories use software tools to determine
the likely functional consequences of missense mutations and identify changes to mRNA
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
8/58
splicing patterns. Predictions based upon these software tools are considered to be acceptable in order to gain insight into the possible consequences of sequence variations but it is unacceptable to assign pathogenicity based solely on these results.
RNA Studies. RNA studies are regarded as the best means for gaining a definitive interpretation of putative splicing mutations. However, there are several limitations: the studies can be time consuming, not all laboratories have the facilities to perform these analyses and limited expression patterns may mean that the required tissue is not available for analysis. 2.7 2.8 Functional Studies. Protein functional studies provide a useful means of
assessing the consequences of amino acid substitutions. However, such tests are only available for a small subset of genes. In addition, proteins may contain multiple functional domains and have multiple functions. Therefore, it should be remembered that if a test indicates that an amino acid substitution does not affect protein function, this does not exclude the possibility that the same substitution might affect other functions of the protein.
Loss of Heterozygosity. Loss of heterozygosity (LOH) occurs in an individual with a germline mutation when the remaining functional allele in a somatic cell becomes inactivated by mutation. For example, in hereditary retinoblastoma, a child inherits from one parent a copy of the RB1 gene that carries a pathogenic change. Most cells will have a functional second copy but chance loss of heterozygosity events (somatic mutations) in individual cells lead to development of retinal cancer. Therefore, hereditary retinoblastoma indicates the presence of a germline mutation. The UV guidelines state that it is acceptable to use LOH to assist in the prediction of pathogenicity of variants in tumour suppressor genes, however this evidence is unlikely to be convincing in the absence of other lines of evidence. 2.9 2.10 Presence or absence in SNP Databases. If a variant is present in an
unaffected individual this may be taken as evidence that it is a benign polymorphism rather than a pathogenic variant. Databases such as dbSNP store the normal variants in a gene and can be used to help decide whether a variant is benign. However, dbSNP entries often lack frequency data and in some instances the contents of LSDBs have been transferred to dbSNP. Therefore, although it is considered essential that dbSNP is searched, the presence of a variant in dbSNP should not be used as sole evidence that it is non-pathogenic in the absence of convincing frequency information. In the future, data from the 1000 genomes project will also be incorporated into this type of analysis. 2.11 Integrating lines of evidence. Bayesian methods for integrating lines of evidence to assess the likelihood of variant pathogenicity have been proposed (e.g. Goldgar et al., 2008). However, such statistical tools are not yet currently available to diagnostic labs. Instead, clinical scientists rank the credibility of different lines of evidence. The most credible lines of evidence are from journal publications; the least credible from bioinformatic analysis (see Figure 1).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
9/58
Figure 1. Strategy for assessing lines of G2P evidence of variant pathogenicity. This figure has been taken from a training course for clinical scientists.
2.12 Classifying variants. After integrating all available lines of evidence the
clinical scientist will classify a variant. Classification strategies may not be the same in all testing laboratories but a typical set of classes might be. Class 1 – Certainly not pathogenic Class 2 – Unlikely to be pathogenic but cannot be formally proven Class 3 – Unable to alter classification Class 4 – Likely to be pathogenic but cannot be formally proven Class 5 - Certainly pathogenic
2.13 Reporting variants. The UV guidelines state that it is essential that laboratories
have mechanisms in place to submit results to existing databases (especially LSDBs). In addition, it is essential that laboratories issue an updated clinical report as new information becomes available to them (i.e. reports should be re-issued when a UV becomes clearly pathogenic or is not pathogenic anymore).
3
Bioinformatics resources for clinical Scientists
As discussed in section 2, clinical scientists make use of databases and sequence analysis tools to determine variants pathogenicity. The following databases are available for the analysis of BRCA1 and BRCA2 variants.
3.1
Locus specific databases.
LSDBs have been defined as ‘‘a collection of sequence variants in a specific gene that causes a Mendelian disorder or change in phenotype’’ (Cotton et al., 2008). There is an underlying assumption that an LSDB is curated by an expert(s) in that gene, and that this expertise is
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
10/58
reflected in the curation of variant data and distinguishes LSDBs from other variant database. LSDBs may be used by clinical scientists to help evaluate variant pathogenicity but this may not be their only use. Greenblatt et al (2008) list a further five possible uses including providing a catalogue of variants in a gene, providing details of DNA variation for in vitro assays and detailing ethnic geographic variation of genetic variants. Because of this, different databases may classify the same variant differently and conclusions may or may not be supported by sufficient reliable data (Greenblatt et al, 2008). There are currently no official standards describing the contents of an LSDB. However there have been recommendations made that would make LSDBs more useful for the classification of variants. Greenblatt et al (2008) made the following five recommendations: 1. LSDBs should only report a conclusion related to pathogenicity if a consensus has been reached by an expert panel. The panel should represent different areas of expertise (clinical, diagnostic, molecular, and computational). 2. The system used to classify variants should be standardized, using the five class IARC (International Agency for Research on Cancer) system developed by Plon et al (2008). Class 5 4 3 2 1 Pathogenicity Pathogenic Likely pathogenic Uncertain Likely neutral Neutral, no clinical significance Posterior Probability >99% 95-99% 5-95% 0.1-5% <0.1%
3. Evidence that supports a conclusion should be reported in the database, including sources and criteria used for assignment. 4. Variants should only be classified as pathogenic if more than one type of evidence has been considered. 5. All instances of all variants should be recorded.
3.2 Databases for the analysis of BRCA1 and BRCA2 variants
The following locus specific databases are available for BRCA1 and BRCA2 genes.
UMD databases. UMD databases exist for both BRCA1 (http://www.umd.be/BRCA1/) and BRCA2 (http://www.umd.be/BRCA2/) variants. These are private, password- protected databases. They store variants from a network of 16 French diagnostic laboratories. The BRCA1 database currently contains 4,222 entries for 1,143 for different mutations, while BRCA2 contains 4,972 entries for 1,513 mutations. 3.2.2 LOVD databases. There are several LOVD databases available BRCA1/2 Publication databases. (http://chromium.liacs.nl/LOVD2/cancer/home.php). These databases store information on BRCA1 and BRCA2 taken from journal articles. The BRCA1 database currently contains 1,465 entries on 502 unique variants, which are taken from 125 publications. The BRCA2 database contains 934 entries on 487 unique variants 3.2.1
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
11/58
which are taken from 60 publications. Both databases are fairly up to date, having been last updated in November 2010. The database content is shown in Figure 2
Figure 2: Example of variant data in BRCA2 publication database. LOVD 2.0 allows the two descriptions of variant pathogenicity, from both the submitter and the curator. The database does not use the 1-5 description of pathogenicity. Instead it uses the following: -? ? +? + No known pathogenicity Probably no pathogenicity Unknown Probably pathogenic Pathogenic
However in these LOVD databases (and in many others) only one classification is given. The second description is always left as “?”. BRCA1 classification databases (http://brca.iarc.fr/LOVD/home.php?select_db=BRCA1). These databases store classifications of previously unclassified BRCA1 variants obtained using a bioinformatics based approach (Tavtigian et al., 2008). Currently there are 112 such classifications in the database. As shown in Figure 3, this database also uses the same description of pathogenicity, but also stores the IARC classification in another column. Note that the two are not in agreement: c.65T>C is described as both “?/?”(i.e. it is considered to be unknown by both the submitter and curator) and “5-Definitely pathogenic”
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
12/58
Figure 3: Example of variant data in BRCA1 classification database. Zhejiang University databases (http://www.chinahvp.org/LOVD/home.php?select_db=BRCA1). These databases don’t currently appear to be working or the URL has changed. Australian Human Variome Databases (https://australianhumanvariomedatabase.arcs.org.au/) This site lists LOVDs for several cancer related genes, including BRCA1 and BRCA2, but at present none contain any variants. Fanconi Anaemia Mutation Database (http://chromium.liacs.nl/LOVD2/FANC/home.php?select_db=FANCD1). This database contains data for genes associated with Fanconi anaemia (FA). There are at least 13 genes associated with FA including FANCD1 (BRCA2). The database contains 57 entries on 35variants. All the variants listed are described as pathogenic. Some data from the BIC database (see 3.2.3) has been reproduced in this database (see Figure 4).
Figure 4: Example of BRCA2 variants in the Fanconi Anaemia Mutation Database.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
13/58
3.2.3 Breast Cancer Information Core (BIC) database. (http://research.nhgri.nih.gov/bic). BIC (Szabo et al., 2000) is a password protected database that stores variant data on BRCA1 and BRCA2 genes. It differs from the other LSDBs in that the pathogenicity of variants is decided by a database committee rather than by the data submitters. Therefore, although the database contains multiple instances of the same variant identified in multiple individuals, the pathogenicity of each instance will be the same. This policy has been in effect since 2006. Prior to then the effect of a variant was reported by the submitter. The change in policy reflects the concern that some BIC entries contained inappropriate conclusions. BIC now classifies the clinical importance of variants as ‘‘yes’’, ‘‘no’’ or ‘‘unknown’’. Unlike the LOVD publication database, BIC does contain unpublished data, much of it from clinical tests. The majority of this data (8826 of 12016 BRCA1 entries and 9891 of 11331 entries) has been supplied by Myriad (http://www.myriadtests.com). BIC does not provide details of when it was last updated, but the date of creation for each individual entry is available. As Figure 5 shows, although the database grew steadily between 1997 and 2004, there are very few entries since 2005 and none since 2008. BIC users are able to download the complete BRCA1 and BRCA2 variant lists in tabdelimited spreadsheet. However the variants are not named using the correct HGVS guidelines. Instead, information on nucleotide position (numbered from the start of the reference sequence and not the start codon) and type of variation are in different columns in the spreadsheet. In addition, variations described as insertions in BIC appear to have been wrongly named when tested with Mutalyzer, which identifies these variations as duplications.
Figure 5: Growth in the size of the BIC database BRCA1 and BRCA2 databases. The database entries refer to the total number of variant not the total number of unique variants.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
14/58
3.2.4
Diagnostic
Mutation
Database
(DMuDB).
(http://www.ngrl.org.uk/Manchester/projects/informatics/dmudb). DMUDB was initiated by Graham Taylor at the University of Leeds and has been developed by the National Genetics Reference Laboratory Manchester to store the results of diagnostic tests, including BRCA1 and BRCA2, for UK laboratories. The database differs from the LSDBs discussed above because it is patient referral centred rather than locus centred. The database currently holds 12,076 referrals, which contain more than 37,000 individual variants in 50 genes. There are reports for approximately 30,000 BRCA variants. Access is currently restricted to staff at UK laboratories. The database provides a mechanism for clinical scientists to store and share data that would otherwise remain unpublished. Users submit their data to the database with the understanding that they retain ownership of the data and can control the extent to which it is shared with other users. In addition, the pathogenicity of variants is decided by the submitter, rather than a committee as done by BIC. However, because UK labs work to similar standard operating procedures (SOPs) there appears to be a high degree of consistency for reports of the same variant from different UK laboratories.
3.2.5.
Human
Gene
Mutation
Database
(HGMD).
(http://www.hgmd.cf.ac.uk/ac/index.php) HGMD has been developed by the Institute of Medical Genetics in Cardiff and Biobase. The database is password protected but access is not restricted to the UK. There is a free version of the database and a more up-to-date version available to paying subscribers. It covers many genes (3,960 in the latest Professional release), including BRCA1 and BRCA2. HGMD is a database of published variants. It provides a link to the first published article on a variant, plus subsequent publications if the enhance the original report (Professional version). In some cases this could be as long ago as 1994. HGMD does not use the HGVS nomenclature to describe variants. For missense change it lists the codon position, the old and new codons and the amino acid change (see Figure 6a). For small (≤ 20bp) deletions, the deleted bases are shown lower case plus the flanking 10 bp in upper case. The numbered codon (if the deletion is in an exon) is indicated with a caret (^) (see Figure 6b). In addition HGMD does not associate a pathogenicity score with the variants but describes the phenotype associated with the variant. (a)
(b)
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
15/58
Figure 6: Examples of variants in the HGMD database showing alternative nomenclature for variants. Some HGMD data have been incorporated into Ensembl. Figure 7 shows the Ensembl view of HGMD entry CM034005 (codon 1458 changing from CAG to TAG), which creates a stop codon and is listed as causing breast cancer in HGMD. Ensembl shows the position of the variant but does not give any other details about the type of change or the phenotype.
Figure 7: Integration of HGMD data into Ensembl. HGMD entries are also stored in the HGVbaseG2P database (see Section 8.3). The variants cannot be searched for but are visible in the HGVbaseG2P genome browser (see Figure 8).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
16/58
Figure 8: Integration of HGMD data into HGVbaseG2P.
3.2.6 Single Nucleotide Polymorphism Database (dbSNP)
(http://www.ncbi.nlm.nih.gov/projects/SNP/) is a publicly accessible repository of genetic variation. As discussed in section 2.10 it is used by clinical scientists to determine whether the variants are benign polymorphisms. Variants from LSDBs have been incorporated into dbSNP. These are described as having “clinical association” and details of the source of the data can be viewed by selecting the VariationView option. Figure 9 shows the submissions for rs55968715 from the Ostrander Lab at the NIH and Lawrence Brody at the NHGRI (this is the BIC database). The table indicates that there is a third submission for this SNP, which is not currently listed. This came from Amanda Spurdle at Queensland Institute of Medical Research.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
17/58
Figure 9: Use of dbSNP Variation Viewer to display BRCA2 clinically associated variants.
3.3
Bioinformatics tools
Clinical scientists use two types of bioinformatics tools to analyse variant pathogenicity. These are missense analysis tools (typically AlignGVGD, SIFT and Polyphen) which estimate the effect of an amino acid change caused by a missense mutation and splicing tools (including Fruitfly, NetGene2 and Human Splicing Finder) which estimate the effect of a mutation on the splicing of an mRNA sequence. As discussed in sections 2.6 it is not considered acceptable for a variant to be described as pathogenic or non-pathogenic solely on the basis of bioinformatics analysis. These tools may not have been developed for clinical use and the accuracy of the results obtained has not been adequately assessed. Because of the uncertainty regarding the accuracy of these tools, clinical scientists will analyse the same variant using several tools and produce an “aggregate result”. To automate this repetitive and time-consuming task Alamut analyses a variant with multiple tools at once and integrate the results into a single view.
Issues
• The Unclassified Variants guidelines state that it is essential that laboratories have mechanisms in place to submit results to LSDBs and in the past they have been criticised for failing to do so. However, in the case of BRCA1 and BRCA2 variants it is not clear which LSDB they could use. The LOVD databases are intended for published data, or for previously unclassified variants, or for a different phenotype, or are not currently in use or accessible. The BIC database does contain clinical data but currently isn’t being updated, while the UMD databases store data for a particular country and are restricted access. This is also true of DMuDB although this policy may be about to change.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
18/58
•
•
•
• • •
•
DMuDB does not follow the usual model for an LSDB in that it does not have curators who are experts on particular genes. Instead the submitters are experts in the sequencing and analysis of variants. Their analyses will be performed to an SOP and will be verified by other experts before submission. Therefore DMuDB has moved from an expert curator to an expert submitter model. The curators’ expertise is in building and maintaining the database. This model does not appear to have restricted the growth of DMuDB, indeed it has the most BRCA variants of the databases considered here. The reporting of pathogenicity varies between LSDBs. The IARC values are only used in one database (the BRCA1 classification databases). It may limit the usefulness of integrating different LSDBs if the clinical interpretation of results cannot also be integrated. The use of the LOVD variant classification system allows curator and submitter to provide their classification of pathogenicity. However in the case of the BRCA1 and BRCA2 publication databases (and other LOVD databases) only one classification is given. This could cause problems when integrating since the classification “+/?” could mean “pathogenic/uncertain” or “pathogenic/no opinion given”. The lack of correct use of HGVS variant names in the some LSDBs, such as BIC and HGMD, will complicate the process of data integration. The integration of clinical variants into dbSNP does not include giving any details of the clinical significance of the variants. Are these pathogenic variants or benign polymorphisms that have been included in an LSDB. It is not clear whether the submissions of clinical variants to dbSNP are independent. For example, the Ostrander Lab has submitted data to BIC in the past. It might be the case that the same identification of a variant is submitted more than once: from the lab that generated the variants and from the LSDB. This could be problematic if this data is used to estimate the frequency of occurrence of these variants. Although, clinical scientists are using multiple tools to evaluate pathogenicity there are many more tools which do not appear to be used. Lists of tools are available at the GEN2PHEN Knowledge Centre (http://www.gen2phen.org/content/functionalprediction) and the PONP (Pathogenic or not pipeline) website (http://bioinf.uta.fi/PON-P/). It may be that these tools generate more reliable results than those currently in use. NGRL Manchester is currently conducting an assessment of clinical bioinformatics software and may recommend that other tools are adopted.
4
Examples of variant analysis by clinical scientists.
The following examples involve real variants which have been analysed by diagnostic laboratories. These examples are used in training courses organised by NGRL Manchester in order to demonstrate how bioinformatics resources should be used and information integrated. 4.1 BRCA1 U14680.1.c.211A>G. This variant causes an Arg to Gly substitution at position 71. The lines of evidence for this variant are: 4.1.1 In silico predictions: • Multiple sequence alignment of BRCA1 orthologs shows that Arg71 is a highly conserved residue
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
19/58
• • •
The change of Arg to Gly has a moderate Grantham distance (125) SIFT predicts that the change from Arg to Gly will affect protein function. The 211A>G is in the -2 position of a 5’ exon splice site. The change affects a predicted exonic splice enhancer (ESE) hexamer. However the change is not in the +1 or +2 and therefore could not be predicted to destroy a splice site. In addition ESEs are frequently predicted to occur within exonic sequences, RESCUE-ESE (http://genes.mit.edu/burgelab/rescue-ese/) predicts 332 ESEs within U14680.1.
4.1.2 Database searches
• • The R71G change has been reported 36 times in the BIC database. The BIC steering group has classified the change as clinically significant The BRCA1 LOVD database contains 5 entries for publications describing R71G (see Figure 10). Two list the variant as pathogenic (+/?) and three list it as non-pathogenic (-/?).
Figure 10: BRCA1 publication database showing c.211A>G submissions. • • This variant is not reported in dbSNP (Since writing this section the variant has been included in build 132 of dbSNP, it is been submitted via BIC). This variant is reported in HGMD. The paper cited (Diez et al 1999) describes several variants identified in 83 Spanish breast cancer/ovarian cancer families and describes this variant as “missense mutations of unknown significance”.
4.1.3 Publications
Searching Pubmed using “BRCA1 c.211A>G” identifies one paper (Santos et al., 2009) while searching using “BRCA1 R71G” identifies the Santos et al (2009) and Diez at al (1999) papers plus a third paper by Vega et al., (2001). Neither search identified the Pettigrew et al., (2005), Ruffner et al (2001) or Morris et al. (2006) papers listed in LOVD.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
20/58
Searching using Google scholar identified rather more papers, 64 for BRCA1 R71G and seven for BRCA1 c.211A>G. No hits were obtained using the full HGVS nomenclature for this variant. The papers identified in Pubmed and LOVD give conflicting reports on the pathogenicity of this variant. Ruffner et al (2001) use an enzyme assay that measures E3 ubiquitine ligase activity. Their results show that although the R71G change occurs with the BRCA1 RING domain, it does not appear to affect enzyme function. The Morris et al., paper also described an enzyme assay and considers the position of the changed residue with the 3D structure of the protein. Their results also indicate the R71G does not affect enzyme function. The paper by Pettigrew et al (2005) is a comparative analysis of predicted ESE sites in BRCA1 orthologs. It is not certain why this paper is reported in the LOVD as describing the 211A>G variant as pathogenic. The paper does not refer to the 211A>G variant, although it does contain data on 330A>G which may be the same variant (there is a 119 bp 5’ UTR, adding this to 211 would place the variant at position 330). However, the paper lists this variant as increasing the ESE (exonic splicing enhancer ) motif score. The crucial piece of evidence in determining the pathogenicity of this variant is provided by Vega et al (2001). They demonstrate that 211A>G is responsible for aberrant splicing of the BRCA1 transcript. The authors demonstrate this using RT PCR on mRNA derived from peripheral blood cells. Two PCR products were identified corresponding to the unaffected and 211A>G alleles. Sequencing of the 211A>G PCR product showed that 22bp of exon 5 were deleted, creating a new stop codon within exon 6. It is interesting to note that again the authors did not refer to this variant as 211A>G but as 330A>G. As well as identifying the effect on splicing the authors also observed the co-segregation of the mutation with the disease, in a large family. The paper by Santos et al (2009) also shows the effect of 211A>G on splicing.
4.1.4 Classification. Because of the effect on splicing observed by Vega et al (2001)
and Santos et al (2009), this variant has been classified as Class 5, certainly pathogenic.
4.2
BRCA1 U14680.1 c.1067A>G. This variant causes a Gln to Arg substitution
at position 356. The lines of evidence for this variant are:
4.2.1 In silico predictions
• • • • • • • Multiple sequence alignment of BRCA1 orthologs shows that Gln356 is a moderately conserved amino-acid The change from Gln to Arg has a small Grantham distance (43) SIFT predicts that the change is tolerated Polyphen predicts that the change is probably damaging Polyphen 2 predicts that the change is probably damaging (score = 0.977) Align GVGD predicts that the change is tolerated. Splice site analysis tools predict a minor effect on splicing. The variant could introduce new ESE hexamers
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
21/58
4.2.2 Databases
• • • • The variant is reported in BIC 82 times and described as “clinical importance unknown”. The variant is reported in dbSNP (rs1799950), allele freq 0.031 (N 495) HGMD lists a publication (Schoumacher et al, 2001) which gives no firm evidence for or against pathogenicity. The BRCA1 LOVD database contains thirteen entries for publications describing Q356R. Four describe the variant as pathogenic (+/?), four non-pathogenic (-/?) and five as uncertain (?/?) (see Figure 11).
Figure 11: BRCA1 publication database showing c.1067A>G submissions.
4.2.3 Publications
Evidence from papers listed in BRCA1 LOVD and a further six papers identified in Pubmed (marked *) are listed in Table 1. Publication Clinical Evidence effect Burk-Herrick et ? Analysed 154 mutations from BIC using an alignments al. (2006) based on 132 mammalian sequences and compares results obtained using SIFT and results obtained by Fleming et al., (2003) Cox et al (2005) Used whole-gene resequencing data to examine the association between BRCA1 SNPs and breast cancer using 1323 cases and 1910 controls. Observed homozygotes of the 356Arg allele in both cases and controls. Diez et al., ? Identified the 1186A>G variant in 7% of Spanish (2003) breast/ovarian cancer patients. No comment regarding pathogenicity. Greenman et al., Describe Q356R as a polymorphism because it does not 1998 meet their criteria for pathogenicity. These are: • Segregation with the disease • Absence in ethnically matched controls • Nonconservative amino acid substitutions
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
22/58
Johnson et al., + (2007)
Lee et al., (2008)
+
McKean-Cowdin et al (2005) Miki et al (1994)
?
?
Tavtigian et al., (2006) (listed as 2005 in LOVD) Tomassi et al., + (2008) Auranen (2005) * et al -
Hadjisavvas al., (2004) *
et +
Janezic et (1999) *
al., +
Menzel et (2004) *
al., -
Residue conserved in the murine and canine homologues of BRCA1 • Occurrence within a conserved and possibly functional motif. Analysed 1037 non-synonymous SNPs in candidate cancer genes in 2463 controls and 473 breast cancer cases with two primary breast cancers. Of all the SNPs assessed in this study Q356R had the highest odds ratio (1.72, minor allele frequency = 5.4%). Used alignment based method. Alignment for BRCA1 residues 225 to 1365 from 55 organisms. Their method indicates that this variant is deleterious, in agreement with polyphen. Identified BRCA1 variants in African-American and Latina women and compared variant frequencies in both populations. No evidence presented relating to Q356R pathogenicity. Early paper identifying candidate breast cancer gene. No evidence specific to this variant, possibly due to renumbering. Described Q356R as neutral based upon co-occurrence with clearly deleterious mutations in BRCA1, and prediction byAlign-GVGD. Assessed pathogenicity using multiple bioinformatic tools. All software used demonstrated a possible biological implication of Q356R BRCA1. Investigated whether polymorphisms in DNA double strand break repair genes are associated with epithelial ovarian cancer (EOC) risk. Study involved 1,600 cases and 4,241 controls from 4 separate genetic association studies from 3 countries. No association detected between EOC risk and Q356R. A pair of rare variants, Q356R and S1512I, was detected in BRCA1 in patients belonging to two Cypriot families. The simultaneous presence of this pair of missense mutations may be associated with the breast cancer phenotype in the Cypriot population Study aimed to provide more accurate frequency estimates of breast cancer susceptibility gene 1 (BRCA1) germline alterations in the ovarian cancer population. The rare form of the Q356R polymorphism was significantly (P = 0.03) associated with a family history of ovarian cancer, suggesting that this polymorphism may influence ovarian cancer risk. Two case control studies in Tyrol and Prague. Did not identify any association between Q356R and breast cancer.
•
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
23/58
Seymour et al., (2008) * Wenham et al., (2003) *
Complete sequencing of coding regions from 217 women from high-risk breast cancer families and 155 age-matched controls. Q356R did not show a significant risk association Case control study of ovarian cancer was performed in North Carolina. Study involved 312 women with ovarian cancer and 401 age and race-matched controls. Q356R not associated with ovarian cancer risk.
Table 1: Overview of publications describing the pathogenicity of BRCA1 c.1067A>G variant. 4.2.4 Conclusion. This variant has been classified as Class 2 – Unlikely to be pathogenic but cannot be formally proven. The variant is thought to probably be a benign polymorphism on the basis of several large association studies. However it is not possible to completely exclude minor effects.
4.3
Issues. As these two examples demonstrate, the job of the clinical scientist in
integrating multiple lines of evidence can be challenging. There are several issues: • There is no formal method for integrating variants and no definition of what is meant by pathogenic. • Bioinformatics tools can produce contradictory answers. • LSDBs may not provide a clear view of the likely pathogenicity of a variant. • It is almost impossible for an LSDB curator to ensure that their database is up-to-date and that no key papers are missed, especially for genes such as BRCA1 and BRCA2 where there are large numbers of laboratories generating data. This task is made even more difficult by authors using alternative naming strategies to describe their variants. • Clinical scientists may need to evaluate the evidence from many publications. • Some published articles use much less rigorous standards in evaluating pathogenicity than clinical scientists. Some of the evidence used to assign pathogenicity in the publications discussed above would not be considered sufficient by diagnostic laboratories. • These discrepancies are reproduced in LSDBs, for example describing a variant as pathogenic on the basis of a Grantham score would not be acceptable for a diagnostic laboratory. • There may be one critical piece of evidence which decides how a variant is classified. For example, in the analysis of c.211A>G the splicing analysis by Vega et al (2001) is key to determining that the variant is pathogenic. If this piece of evidence was not identified by the clinical scientist analyzing this variant, they may decide that the variant was less likely to be pathogenic based on the enzyme assay developed by Ruffner et al (2001).
5
The ENIGMA consortium
The ENIGMA consortium has been established in order to classify BRCA1 and BRCA2 variants which are currently unclassified. The consortium will function by pooling information, for example on segregation within BRCA families or immunohistochemistry
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
24/58
results; and by selecting sets of variants for analysis, for example mRNA splicing analysis or enzyme assays. This is a similar approach to that used by clinical scientists. Multiple lines of evidence are combined in order to produce a classification and the lines of evidence are the largely the same as those that would be considered by a clinical scientist. However, the variants being considered are ones that clinical scientists have been unable to classify often because there was not sufficient data available for them to develop a classification. In order to store the multiple lines of evidence, ENIGMA is developing a relational database. This database will not be made accessible outside the consortium. Instead, Enigma will make classifications available both via publications and by submission to the BIC database. The process of requirements gathering for the ENIGMA database is still ongoing. However from responses received to date it is clear that the consortium will require far more data to be stored than is usually the case for LSDBs. The following are a selection of the proposed data fields for in vivo analysis of mRNA splicing: • Forward Primer • Reverse Primer • RNA source - Cells / tissue (and collection method) • Culture conditions • Nonsense mediated decay inhibition • RNA extraction method • Dnase1 treatment • RNA storage • Amt RNA used in cDNA synthesis • cDNA synthesis primer • cDNA synthesis protocol • PCR Amplification • PCR product analysis • Aberration(s) Detected Experimentally - Qualitative data • Aberration(s) Detected Experimentally - Quantitative Data - including allele-specific detection assays • Aberration(s) at RNA level - HGVS nomenclature • Aberration(s) at protein level - HGVS nomenclature • Genomic Co-ordinates of Aberration(s) • Level of full-length transcript produced by variant allele (%) • Overall Level of full-length transcript (%) • Methodological deficiencies • Comment • Qualitative 5-class IARC Splicing Classification (Spurdle et al., 2008)
6 Role of next generation sequencing in clinical sequencing.
At present the genetic screening of breast cancer patients focuses on the sequencing of BRCA1 and BRCA2 genes. However, these are not the only genes associated with breast and ovarian cancer. Genes associated with other diseases such as BRIP1 and PALB2 (Fanconi
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
25/58
anemia), TP53 (Li-Fraumeni syndrome) PTEN (Cowden syndrome), STK11 (Peutz-Jeughers syndrome) and CDH1 (hereditary diffuse gastric cancer) have been associated with breast cancer (Seal et al., 2006; Rahman et al., 2007; Gonzalez et al., 2009; FitzGerald et al., 1998; Hearle et al., 2006; and Schrader et al., 2008), while genes responsible for HNPCC have been associated with ovarian cancer (Aarnio et al., 1999). Therefore sequencing only BRCA1 and BRCA2 may miss the underlying mutations responsible for breast or ovarian cancer. In addition, genetic testing for BRCA1 and BRCA2 mutations tends to be used for women where there is a family history of breast or ovarian cancer. However, in some cases breast cancer patients will not have a family history of breast cancer, despite an underlying somatic mutation, because their mutation was paternally inherited, their family is small and no other female family members inherited the mutation. The recent advances in sequencing technologies could help to solve both these problems. Using next generation sequencing, it will be possible to sequence many genes simultaneously and the speed and, in time, the reduced costs associated with these technologies may mean that more patients can be tested. An example of how next generation sequencing may be used is provided in a recent publication by Walsh et al., (2010). In this study 21 genes responsible for inherited risk of cancer were fully sequenced for 20 patients and small variants and large deletions and duplications identified and analysed. Clearly, if more genes are being sequenced in a larger number of patients, the amount of bioinformatics analyses required will increase. Part of the analysis method used by Walsh et al., (2010) is shown in Figure 12. Locus-specific databases and dbSNP were searched to establish whether a variant was known to be either pathogenic or a benign polymorphism. Candidate variants were checked to determine whether they were exonic, intronic or intergenic and the effect of the variation (nonsynonymous substitution, frameshift, splice site mutation etc).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
26/58
Figure 12: Methodology for analysis of variants used by Walsh et al., (part of figure from Walsh et al., (2010)) As well as being used to identify genetic causes of breast cancer, next generation sequencing techniques have been used to identify the genes underlying rare Mendelian disorders. For example, Ng et al., (2010) identified DHODH as a candidate gene for Miller Syndrome and Bilgüvar et al., (2010) identified WDR62 mutations in severe brain malformations. A feature of these studies is that they require the sequencing of only few individuals. Ng et (2010) al., sequenced four individuals, including two siblings, while Bilgüvar et al., initially sequenced two individuals from a small consanguineous family and then identified WDR62 mutations in other individuals. Whole exome sequencing will generate large numbers of variants. To avoid having to analyse thousands of variants, researchers have developed analysis methods which focus on filtering as many variants from further analysis as possible. For example, Ng et al began by focussing only on non-synonymous variants, splice acceptor and donor site mutations and short insertions or deletions. The same variant had to be present in both siblings and a variant had to be present in the same gene in the other two kindreds. Common variants (present in dbSNP129 or HapMap8) were excluded. This reduced the number of candidate genes to nine, assuming a recessive model of inheritance for Miller syndrome. This type of methodology is suitable for a disorder such as Miller syndrome. Variants causing a rare disease such as Miller syndrome are unlikely to be present in dbSNP and because the disease has a well defined phenotype; variants can be eliminated by comparing individuals. As well as identifying candidates for rare diseases, whole exome sequencing is being used to identify candidates for more frequently observed phenotypes. Vissers et al (2010) investigated de novo mutations which might be responsible for mental retardation by sequencing eight parent child trios. Because the authors were searching for de novo mutations, they were able to exclude variants identified in the children that were inherited from the unaffected parents. In addition, variants present dbSNP were also excluded as were nongenic, intronic and synonymous variants. The extent to which NGS will be used in clinical genetics is still not clear. However, next generation sequencers are already being used for clinical sequencing. For example, sequencing of clinically important genes, including BRCA1 and BRCA2, is now being offered by NewGene (http://www.newgene.org.uk) using a Roche 454 sequencer. In this case an existing test is being offered using techniques which allow higher throughput and faster turnaround times. NGS will also allow an increase in the range of genetic tests available. For example, the Manchester Biomedical Research Centre is developing improved eye gene tests. To date, roughly 140 genes have been associated with eye disease. This is clearly too many to be sequenced in a conventional test, instead a new method using targeted exon enrichment and sequencing using a SOLiD sequencer is being developed.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
27/58
As “third generation” sequencers become available there may be yet more changes to clinical sequencing. These methods, such as SMRT (single molecule, real time) developed by Pacific Biosciences (http://www.pacificbiosciences.com), and label-free single molecule sequencing, developed by Oxford Nanopore Technologies (http://www.nanoporetech.com/) do not require PCR amplification of template DNA and should be faster than second generation technologies such as Illumina and SOLiD sequencers. These techniques could also be much less expensive than conventional sequencing, the “$1000 genome” has been suggested but even cheaper genomes may be possible. GnuBio have suggested that a whole genome might be sequenced for $30 (http://fluidicmems.com/2010/06/03/gnubio-will-droplet-basedsequencing-from-the-weitz-lab-win-the-race). The development of these techniques may mean that clinical whole genome sequencing becomes a technical and financial possibility, even if the means to interpret the data is not in place. As well as being cheaper and faster, third generation sequencing offers the prospect of new types of sequence being generated. For example, the Pacific Biosciences SMRT sequencer has been used to directly detect DNA methylation without the need for bisulfite conversion (Flusberg et al., 2010). Other third generation methodologies such as nanopore sequencing will offer similar possibilities. This raises the possibility of new types of clinical sequencing experiments such as identification of epigenetic modifications as a part of cancer diagnosis (Costa, 2010).
6.1
•
Issues
A diagnostic test involving the sequencing of many genes, such as the eye disease genes test described above, raises questions about how the variant data should be managed. One model might be that the variants identified in 140 genes would then be sent to 140 different LOVDs. However, it is unlikely that those generating the data would be happy with having to repeat the same task 140 times. Instead, it might be more reasonable to expect them to add their data to a single database and this database might then be used to distribute their data across LSDBs. For example data might be stored in a repository such as DMuDB. This repository then uses Café RouGE (http://www.caferouge.org) to inform curators of the variants. In order to deal with the large numbers of variants identified by NGS experiments, bioinformaticians will need to develop informatics pipelines to automate as much of the analysis as possible. A pipeline that analyses variants from 140 genes will ideally not have to include 140 different LSDBs, since the likelihood of part of the pipeline failing will increase with each web service added. In addition, there would be problems caused by the lack of standards for describing pathogenicity across different LSDBs.
•
7
Genome-wide association studies.
GWA studies are a method of performing an association study without having prior knowledge of which genes are likely to be involved (Need and Goldstein, 2010). They have been used in recent years to explore the relationship between common genetic variation and disease, biological characteristics and drug responses. Underpinning GWAS is the theory that
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
28/58
common diseases are caused by common variants (Chakravarti, 1999; Reich and Lander, 2001). However, while GWA studies have generated hundreds of confirmed susceptibility factors, the variants that are identified are only responsible for a small fraction of the genetic variation (Goldstein, 2009). Because of this, these results are considered of little diagnostic utility by clinicians and are not incorporated in clinical tests. The reason for this may be that GWAS experiments analyse shared ancestral SNPs which have been maintained within the population for many generations and occur frequently (minimum allele frequency > ~5%) in the population. Because of this it is argued that any variant that has survived within a population for a large number of generations cannot be highly pathogenic. There is some evidence for this argument since most of the factors identified in GWAS studies have odds ratios well below 1.5 (Read and Donnai, 2010). This has led to the development of the alternative to the “common disease, common variant” theory, which is that rare variants can play an important role in the development of common diseases (for example, see Cirulli and Goldstein 2010). Another hypothesis is that while the effect of individual SNPs is small the combined effect of having multiple SNPs may be much larger (for example see comments by Kári Stefánsson in Guttmacher et al., 2010). If this is correct and combinations of common variants can be shown to have a significant effect on genetic diseases, then it may be that future genetic testing will incorporate GWAS based evidence. If so, there may be a need to develop more precise definitions of pathogenicity to distinguish low penetrance GWAS alleles from the more highly penetrant ones associated with current genetic testing.
8
Use of GEN2PHEN deliverables.
The aim of the previous sections was to establish how clinical scientists use G2P data and how this might be affected by the development of new sequencing technologies. In the following sections, we consider how GEN2PHEN deliverables might be used by clinical scientists. Our intention was to highlight issues that might typically arise when the software is used.
8.1
Mutalyzer
The Mutalyzer software is widely used in diagnostic laboratories and is recommended for use by clinical scientists in the Unclassified Variant guidelines discussed in Section 2. The development of Mutalyzer 2.0 makes it more suitable for the analysis of large numbers of variants. Batch analysis is available which allows rapid checking of many variants in one submission. When tested we were able to check more than 2000 variants in approximately one minute. In addition the development of a web service will enable Mutalyzer to be included in an automated pipeline. Issues The following issues arose when testing this software: Reference sequences not found: Mutalyzer was unable to find some of the reference sequences used by diagnostic laboratories. For example L11353.1, which is used as a
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
29/58
reference sequence for NF2 testing, M28668.1 (CFTR) and AF395588.1 (AFDR) returned a “Gene not found error message”. It is not clear why these sequences cannot be retrieved. Nucleotide not found error messages. Mutalyzer compares the change submitted by the user with the reference sequence and returns an error if the bases do not match. For example, U43746.1:c.7397T>C give the error message T not found at position 7625, found C instead. As this example shows the error message appears to refer to a different nucleotide (7625) to the one submitted by the user (7397). This difference is caused by the user numbering variants according to their position relative to the A of the start codon while the Mutlyzer error message numbers from the first position in the reference sequence. It might be useful if this difference could be made clear in the error message. Checking intronic variants. Intronic variants are described by their position relative to the splice sites in an mRNA sequence. For example, U14680.1. c.212+3A>G occurs 3 bases into the intron which begins after the nucleotide corresponding to position 212 in the mRNA. However, Mutalyzer cannot check an intronic sequence using a reference sequence which does not contain intonic sequences and returns an error message (Error: (Mutalyzer): Intronic position given for a non-genomic reference sequence.). Use of LRGs The problem of checking intronic variants could be solved by the use of LRGs to describe variants since these reference sequences contain both the transcript and genomic sequences.
8.2
WAVe (Web Analysis of the Variome).
WAVe is locus-specific database integration tool. It integrates variants from multiple locus specific databases and provides links to the original LSDBs. It also allows users to obtain data about the gene locus (Gene Cards, HGNC GeneNames, Entrez Gene Report), publications (QuExT searches by gene and disease (Matos et al., 2010)), disease (OMIM), pharmacogenomics (pharma knowledge base), genome (NCBI Map View and Ensembl), pathways (KEGG and Reactome), protein (SwissProt, TrEMBL, PDB, ExPASy, InterPro), GO terms, DiseaseCard and the GEN2PHEN Knowledge Centre. The process of integrating variants is achieved by warehousing variant data into a WAVe database. Variants are gathered from LSDBs in two ways. For LSDBs instances recent versions of LOVD (2.0 and later), the variant API is used to import all variants into the WAVe database. For the remaining systems (UMD, MUTbase and other legacy applications) a web crawler, is used to find variants described in the HGVS format in the webpage's HTML. During the variant gathering process, the HGVS name is checked in order to identify the type of change associated with each variant (“del”, “ins”, etc). The addresses of LSDBs are taken from a manually curated list incorporating the LSDB listing available on the GEN2PHEN Knowledge Centre and the LOVD database index.
Issues For BRCA1 and BRCA2, WAVe provides links to the following LSDBs: • The Australian variome database (empty)
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
30/58
• • • •
Chinese databases with a broken link (domain name does not exist). This may be the old URLs of the Zhejiang University databases. LSDBs, the BRCA1/2 Publication databases Zhejiang University databases UMD databases.
This is not a problem with WAVe but highlights some of the problems for curators of lists of LSDBs. • There are many LSDBs which are empty, more so because GEN2PHEN has created large numbers of LSDBs that are awaiting curators. • Broken links caused by databases being moved. • Some of the LSDBs cannot be accessed because they are password protected. For example, of the 616 BRCA1 variants listed in WAVe, none appear to come from the UMD database, presumably because this database is password protected. However, because the UMD database is included in the list of BRCA1 LSDBs it appears that variants from this LSDB will be included in the integrated list. • Databases are excluded from the list. For example, the BIC database is not included. This might be because it is password protected or because the variants are not named in accordance with the HGVS nomenclature. However, this is unfortunate because clinical scientists consider BIC classification to be an important line of evidence when evaluating BRCA1/2 variants. Other databases may be excluded simply because they are not widely known about. For example, the BRCA1/2 classification database (2.1.2.2) does not feature in any of the curated lists. In the GEN2PHEN 6th General Assembly Meeting there was some discussion on the need to develop a unified list of LSDB resources. The WAVe project highlights the need for such a list as well as the need for methods to ensure that the list can be kept up to date. Update: During the process of preparing this report a list of LSDBs has been made available at the GEN2PHEN Knowledge Centre (http://www.gen2phen.org/data/lsdbs) and is temporarily mirrored at EBI (http://www.ebi.ac.uk/~pontus/lsdb_list.php.html). The mirror site at the EBI will shortly be replaced by one at LRG website (http://www.lrg-sequence.org/page.php). Publication searches. WAVe allows the user to identify publications linked to their gene of interest. Searches are performed using QuExT (http://bioinformatics.ua.pt/quext/) and can be “by Gene” or “by Disease”. For BRCA1, these searches identify 442 and 358 articles respectively. • The number of publications returned is much lower than the number identified using Pubmed (7684 articles). • QuExt does not appear to allow the user to search for a specific genetic variant, e.g. BRCA1 Arg71Gly. • QuExt does not return publications in date order. This makes it difficult for a user to identify those that are recently published.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
31/58
8.3
HGVBaseG2P
HGVBaseG2P is a database that stores summary level findings from genome wide association studies (GWAS). It currently contains data from 565 such studies.
8.3.1 Searching HGVbaseG2P using a SNP.
HGVbase was searched with the SNP rs4986852. This is situated within the BRCA1 gene and corresponds to the missense substitution NM_007294.3.c.3119G>A changing Ser to Asn at position 1040. This variant is listed as “uncertain significance” in BIC and six of the eight entries in the LOVD publication database also describe it as uncertain (the other two state that it is non-pathogenic). Searching HGVbaseG2P with rs4986852 returns a marker, showing that this SNP has been used as a marker in GWAS.
Figure 13: Result of searching HGVbaseG2P for rs4986852. The reference sequence coordinates for the SNP are listed as 38497955 on chromosome 17 (see Figure 13). This is different from the coordinate listed in dbSNP (41244429) and outside of the current coordinates of the BRCA1 genes in the NCBI reference sequence (NC_000017.10) which are Chr17:41196312..41277500. This appears to be because the HGVbaseG2P coordinates are taken from a different assembly of chromosome 17. As shown in Figure 14 the position 38497955 was used from genome build 36.3 but is not in 37.1.
Figure 14: Position of rs4986852 SNP in builds of dbSNP. Selecting “Results” returns a list of 10 studies. Because the –log p value is set to ≥ 0 all 10 studies in which this SNP was used as a marker are returned. The studies (shown in Figure 15) are by Strachan et al (2007).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
32/58
Figure 15: Part of a list of 10 studies associated with the BRCA1 SNP rs4986852. The unadjusted P-value for rs4986852 in each of the ten studies is shown on the left column of the table. The most significant P-value is for a study investigating adult body mass index. It is not clear how a user should interpret this result. Clearly BRCA1 variants are associated with breast and ovarian cancer not body mass index. However, this marker has not been used for any studies of these cancers. Furthermore, the design of GWAS experiments for breast/ovarian would probably exclude individuals with BRCA1 and BRCA2 mutations. This is the case for the breast cancer study included in HGVbaseG2P by Kibriya et al., (2009)
8.3.2 Searching HGVbaseG2P using a region.
The database was searched using the coordinates for BRCA1 (41196312..41277500). As described above, these will not be the correct coordinates for the genome build used in the current version of HGVbaseG2P, but serve as an example. The search returns 28 markers for this region. By increasing the significance threshold to – log P ≥ 3 (i.e. P value ≤ 0.001) the number of markers is reduced to 10 (see Figure 16).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
33/58
Figure 16. Searching HGVbaseG2P for markers with P-values less than 0.001 using region 41196312 to 41277500 of chromosome 17. The P-values of the 10 SNPs cannot be viewed because of the possibility that individuals could be identified. However it is possible to deduce that the values must be between 0.001 and 0.0001 since setting the P-value threshold to –logP ≥ 4 does not return any markers.
8.3.3 Searching HGVbaseG2P using a gene name.
The result of searching HGVbaseG2P with BRCA1 is shown in Figure 17. There are 39 markers associated with this gene and that these markers have been used in 28 studies.
Figure 17: Searching HGVbaseG2P for markers using BRCA1 text search There is one study with a keyword match to BRCA1. This is a study of individuals with breast cancer who do not have deleterious mutations in BRCA1 or BRCA2 (Kibriya et al., 2009). None of the BRCA1 gene markers have significant P-values in this study. There are only two studies in which BRCA1 markers have a P-value less than 0.01, these studies are investigating height in the British population and Parkinson’s disease (see Figure 18).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
34/58
Figure 18: GWAS studies associated with markers in the BRCA1 gene. Issues • The issue of changing coordinates in different genome assemblies may cause problems for users. It might be useful if the genome assembly being used by HGVbaseG2P was shown on the website. • Clinical scientists will probably be more used to interpreting odds ratios rather than Pvalues. It may be useful to provide some additional information to aid interpretation. In addition, it is not obvious whether the P-values indicate that markers are associated with increased or decreased risks.
8.4
GEN2PHEN Knowledge Centre
The knowledge centre (KC) provides a central platform for access to genotype to phenotype data with specialist knowledge. The KC will be used to disseminate information about the GEN2PHEN project, provide access to project deliverables, GEN2PHEN training tools and G2P resource lists.
8.4.1 Locating GEN2PHEN resources in the Knowledge Centre.
GEN2PHEN users can locate resources from the home page of the KC. There are a series of tabs and drop down menus. Figure 19, shows the user selecting GEN2PHEN deliverables.
Figure 19: Selecting GEN2PHEN resources using the Knowledge Centre Finding other GEN2PHEN resources is not so straightforward. There does not appear to be any obvious link to the GEN2PHEN training tools or the resource list. Links are given to these in the recently submitted GEN2PHEN paper (Webb et al., 2011) but it would be useful
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
35/58
to have a clear link on the homepage. In addition, the link to the GEN2PHEN resources given in the paper (http://www.gen2phen.org/resources) appears to be broken (404 error). Using the link to the GEN2PHEN training page (http://www.gen2phen.org/training) provided in the paper allows the user to access the training tools which have been provided to date (see Figure 20).
Figure 20: Identifying GEN2PHEN training resources using the Knowledge Centre Two of the tools, “Adding a custom log to LOVD” and “Submitting mutation data to the OI and EDS database” are not accessible without registering as a GEN2PHEN user. It is not clear why this need be the case and it might be better if the same policy was applied to all training materials. Also requiring a user to register and login might deter them from using these resources.
8.4.2 Using GEN2PHEN data via the Knowledge Centre
Selecting the GEN2PHEN data tab on the homepage allows the user to access data sets that have been generated by the project. So far this is a single set from an analysis of LSDBs produced by Mitropoulou et al (2010) generated as part of deliverable 2.3. LSDBs can be searched for using gene name or database name. This tool could be useful for clinical scientists looking to find all the available LSDBs for a gene. Searching for BRCA1 produces one hit, the BRCA1 publication database (see Figure 21). The BIC database, BRCA1 classification database and UMD BRCA1 database are not returned. The search also supplies information about the last update of the database. However, this is incorrect; the publication database was last updated in November 2010.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
36/58
Figure 21: Identifying BRCA1 LSDBs using the Knowledge Centre Using BRCA2 produces two hits (see Figure 22), but these are to the same Fanconi anaemia database. The first LSDB in the list only provides a link to the new version of the Fanconi anaemia database which has been developed using LOVD.
Figure 22: Identifying BRCA1 LSDBs using the Knowledge Centre This search also did not return the UMD BRCA2 database or the BIC database. It also did not return the BRCA2 publication database. In contrast the HGVS website also allows users to search for LSDBs (http://www.hgvs.org/dblist/glsdb.html). As Figure 23 shows, HGVS for BRCA1 and BRCA2 LSDBs returns more hits, but still excludes the UMD BRCA2 database and the BRCA1 classification database and the Zhejiang University URLs no longer work.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
37/58
Figure 23: Identifying BRCA1 and BRCA2 LSDBs using the HGVS website. The problem for both the GEN2PHEN and HGVS LSDB search tools is that they are returning LSDBs from a either a spreadsheet or database and in both cases this is incomplete and out of date. Therefore both miss databases, and the information about them (URLs and last updates) can be incorrect. The situation could be improved by merging the HGVS and GEN2PHEN list but the problem of returning a search from an out-of-date database will remain.
8.5
Obtaining feedback from the clinical science community.
A set of online surveys were developed for GEN2PHEN software. The aim of these surveys was to advertise software to the user community and to obtain feedback that might help guide future development. The surveys were sent to the 115 members of the ENIGMA consortium, who are investigating unclassified BRCA1 and BRCA2 variants, and to NGRL mailing list which is sent to more than three hundred individuals working in clinical genetics. The surveys were kept short in the hope that recipients would complete more than one survey. Unfortunately the response to these surveys was very poor. Despite contacting several hundred individuals who might be expected to use GEN2PHEN software, the largest response we received for any piece of software was eight. This lack of response is similar to that at the 2010 BSHG conference, where a GEN2PHEN stand was set up and handouts for software were made available. It raises questions as to how we should go about publicising GEN2PHEN and obtaining feedback for software for the remainder of the project. For completeness the responses to the surveys are listed below, although clearly there was not a large enough response upon which to base any development decisions. Question 1: Had you heard of this software prior to this survey? Yes No HGVbaseG2P 2 3 Knowledge Centre 0 5 LOVD 7 1 WAVe 0 3
Question 2: Did you find this software easy to use?
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
38/58
1 (difficult) 2 3 4 5 (easy)
HGVbaseG2P 0 0 2 2 0
KC 1 0 0 2 1
LOVD 1 0 1 5 1
WAVe 1 0 0 4 0
Question 3: Did you find the presentation of results easy to understand? 1 (difficult) 2 3 4 5 (easy) HGVbaseG2P 0 1 1 2 0 KC 1 0 1 1 1 LOVD 1 0 1 6 0 WAVe 1 0 0 1 1
Question 4: How useful is this software for analysing clinical data? 1 (not useful) 2 3 4 5 (very useful) HGVbaseG2P 0 1 1 2 0 KC 1 1 1 1 0 LOVD 1 0 1 5 1 WAVe 1 1 0 1 0
Question 5: Any further comments about this software? Feedback comments were only obtained for LOVD. These were “Too many data in a single row” “Question 4 not really relevant to me as don't analyse clinical data” (this was from the individual who gave the software a score of 1).
9
Generating LRGs for BRCA1 and BRCA2
New LRGs are produced by the EBI and NCBI in response to request from a user or consortium of users who require a LRG for reporting variants. As an example of how this works the process of requesting LRGs for BRCA1 and BRCA2 has been documented. This includes the analysis of existing reference sequences for BRCA1 and BRCA2 and the identifying differences between them, as well as consideration of the different splicing variants.
9.1 Available sequences.
BRCA1
and
BRCA2
LSDBs
and
reference
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
39/58
Several LSDBs exist for both BRCA1 and BRCA2. Table 1 lists these LSDBs and the reference sequences they used to describe variants. The LSDBs were found in the HGVS list (http://www.hgvs.org/dblist/glsdb.html), and from the LOVD list of databases and from our knowledge of the ENIGMA group. There may well be other not yet publicly available datasets that have used other reference sequences.
Location BRCA1 The UMD BRCA1 mutations database Chromium.liacs.n l URL Reference sequence Database Version
http://www.umd.be/BRCA1/
U14680.1
http://chromium.liacs.nl/LOVD2/cancer/h ome.php
Zhejiang University Centre for Genetic and Genomic Medicine brca.iarc.fr BIC BRCA2 The UMD BRCA2 mutations database LOVD BRCA2
http://www.chinahvp.org/LOVD/home.php?select_db=BR CA1 (this URL no longer works)
NG_005905.1 (Note: this is a RefSeqGene the transcript id it uses is NM_007294.3) NM_007294.2
BRCA1 091003
BRCA1 100114
http://brca.iarc.fr/LOVD/home.php?select _db=BRCA1 http://research.nhgri.nih.gov/bic/ http://www.umd.be/BRCA2/
U14680.1 U14680.1 U43746
BRCA1 091215
http://chromium.liacs.nl/LOVD2/cancer/ho me.php?select_db=BRCA2
Zhejiang University Centre for Genetic and Genomic Medicine Fanconi Anaemia Mutation Database (BRCA2) BIC
http://www.chinahvp.org/LOVD/home.php?select_db=BR CA2 (this URL no longer works)
NG_012772.1 (Note: this is a RefSeqGene the transcript id it uses is NM_000059.3) NM_000059.3
BRCA2 091003
BRCA2 100115
http://chromium.liacs.nl/LOVD2/FANC/ho me.php?select_db=FANCD1
NM_000059.1
FANCD1 080908
http://research.nhgri.nih.gov/bic/
NM_000059.1
Table 2: LSDBs and reference sequences for BRCA1 and BRCA2. Sequences used for clinical testing at the Regional Genetics Laboratory Services, Manchester are U14680 (BRCA1) and U43746 (BRCA2)
9.2
Comparison of Reference sequences.
Alignment of the predicted protein sequences provided by the BRCA1 reference sequences shows that there is no difference between them. For BRCA2 there are two differences: • At position 372 the H in NM_000059.1 and U43746 is replaced by N in NM_000059.3.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
40/58
•
At position 599 the F in NM_000059.1 and U43746 is replaced by S in NM_000059.3.
Comparison of DNA sequences show that there are several differences between BRCA1 reference sequences: • According to their annotation, the first exon of NM_007294.2 is exon 1B, while for NM_007294.3 it is exon 1A. • U14680 sequence does not extend as far into the 5’ UTR as the other reference sequences and does not have any of the 3’ UTR. The final codon in U14680.1 is the stop codon. • Alignment of the first exon of U14680.1 to the RefSeqGene NG_5905.1 shows that there are 4 mismatches and 1 gap. This exon does not contain any of the open reading frame. There are also differences between the BRCA2 reference sequences. Alignment to the RefSeqGene NG_012772.1 indicates: • Both NM_000059.1 and U43746 contain two exons 27 (87683-88731) and 28 (8913089190). In NM_000059.3 these are replaced by a single exon 27 (87683-89193). • Both NM_000059.1 and U43746 contain 8 mismatches and 1 gap. Two of the mismatches cause the two amino acid mismatches discussed above. There is a further change within the ORF, at position 4791 there is a G instead of an A. The remaining differences are in the 5’ and 3’ UTR.
9.3
Alternative Splicing
There are no reports of alternate BRCA2 transcripts in either Ensembl or RefSeqGene and no publications were identified in PubMed. However, there is alternative splicing of BRCA1. Table 3 has been taken from a review of BRCA1 alternative splicing by Orban and Olah (2003). The paper lists a number of splice variants which have been identified. As noted by the authors there are problems with identifying alternate transcripts since many publications detail aberrant splicing by pathogenic mutations. Since the publication, several of these sequences have been suppressed by Genbank because “only partial transcript evidence exists for this transcript variant, and its full-length exon combination is unclear”. However, it could be that there are other alternate transcripts missing from Table 3. For example, Ensembl lists a total of thirty transcripts which are predicted to encode proteins.
Name of the variant Full length BRCA1 With exon 1a (NM_007294) With exon 1b (NM_007295) D(2–10) (NM_007297) D(9,10) (NM_007302) D(9,10,11q) (NM_007305) D(9,10,11) ORF maintained? Yes Yes Yes Yes Yes Yes Yes Tissues Breast, ovary, testis, thymus, various other Breast, ovary, testis, thymus Placenta Breast, lymphocytes Breast, ovary, lymphocytes Breast, ovary, lymphocytes Breast, lymphocytes Comment
Still exists, transcript variant 1 Suppressed † Still exists, transcript variant 3 Suppressed † Suppressed Still exists, transcript variant
†
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
41/58
(NM_007298) D(11q) (NM_007304) D(11) (NM_007303) D(14–17) (NM_007299) D(14–18) (NM_007300) D(15–17) (NM_007301)
Yes Yes Yes Yes No
Breast, ovary, lymphocytes Ovary, thyroid Breast, lymphocytes Breast, lymphocytes Breast, lymphocytes
4 Replaced by NM_007298. Suppressed † Still exists, transcript variant 5 Still exists, transcript variant 2 Record removed. Record removed.
-6 nt from 3’ of exon 1a (NM_007296)
Yes
Kidney, lung, other
Table 3: BRCA1 alternate transcripts. This table has been adapted from Orban and Olah 2003. The comments column, which details whether a GenBank accession still exists, has been added for this report. The original table also featured other transcripts which did not † have Genbank accession numbers. Transcripts are permanently suppressed because only partial transcript evidence exists for the transcript variant, and its full-length exon combination is unclear (from NCBI site).
9.4
Alignment to genomic reference sequence.
In order to determine the extent to which the reference sequences and alternative transcripts overlap they were aligned to the RefSeqGene using Spidey (http://www.ncbi.nlm.nih.gov/spidey/). Predicted exon start and end positions of the transcripts are shown in Table 4. Points to note: • Although NM_007294.2 and NM_007294.3 are supposed to begin with exons 1B and 1A respectively, this is not clear from our alignments. Both exons finish at the same position (10691). In contrast, two other transcripts that begin with exon 1B, NM_007297.3 and NM_007299.3, finish at 10685. • There are cases (labelled red) where Spidey appears to have misaligned the exon slightly. • There are also apparently real variations in some exon boundaries (labelled blue). • There is an addition exon in NM_007300.3 which does not exist in other transcripts. Table 5 shows the alignment of the three BRCA2 LSDB reference sequences to the RefSeqGene NM_000059.3. There is one major difference; NM_000059.3 does not have exon 28. Instead exon 27 spans the region covered by exons 27 and 28 in the other reference sequences.
9.5
Annotation of transcripts.
Table 6 shows how the transcripts in the BRCA1 reference sequences and alternative transcripts have been annotated in their respective Genbank files. This has not been done for BRCA2 because two of the three Genbank files do not include annotation of exons. As the table shows equivalent exons can have different annotation in different transcripts. This may not be important, since HGVS nomenclature does not include exon numbers. However, exon numbers are used by clinical scientists and researchers when discussing sequencing experiments and mutations and also to catalogue variant data e.g. in databases. Therefore
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
42/58
there is potential for confusion when they are mapping their existing knowledge onto LRGs. Points to note are: • The exons of U14680.1 are labelled sequentially 1-24 with exon 4 missing. This is because the original exon 4 is now considered to have been a cloning artefact (see for example Brose et al., 2004). • In other transcripts (other than NM_007294.2) there are exons 4a (in NR_027676.1) or 4b which are equivalent to exon 5 in U14680.1 and NM_007294.2. Possibly the “a” and “b” have been used to avoid confusion with the original exon 4. • NM_007294.2 does have an exon 4 which is equivalent to exon 3 in other transcripts. The reason for this is that NM_007294.2 does not have an exon annotated as “exon 2”. This reflects the previous annotation of other transcripts. For example, the first exons of transcript NM_007300.2 are labelled 1b, 3, 4, 5. • Exon 11 in U14680.1 is equivalent to exons 10b in some reference sequences and 11b in NM_007294.2. There is also an exon 10a which is found in alternative transcripts. • Exon 11 in U14680.1 is equivalent to exons 10b in some reference sequences and 11b in NM_007294.2. There is also an exon 10a which is found in alternate transcripts NM_007298.3 and NM_007299.3. • Exon 14 in U14680.1 is equivalent to exon 14a. There is also an exon 14b which is found in alternate transcripts NM_007298.3, NM_007299.3 and NM_007300.3.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
43/58
Exon 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
NM_007294.2 Start end 10511 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91665
NM_007294.3 start End 10479 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007297.3 Start End 10511 10685 11847 11945 29429 31006 31701 36082 38672 40040 41102 44930 53387 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 29506 31094 31840 36186 38718 40116 44527 45018 53558 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007298.3 start end 11847 20183 29429 31006 31701 36082 38672 40040 41102 44929** 53387 59351* 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 11945 20236 29506 31094 31840 36186 38718 40116 41217* 45018 53558 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
NM_007299.3 start end 10511 10685 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 41217* 45018 44929** 53387 53558 59351* 61441 64724 68267 72011 72589 78827 84845 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 88319 91667
NM_007300.3 start end 10479 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 56563 56628 59351* 59474 61441 61631 64724 65034 68267 68354 72011 72088 72589 72629 78827 78910 84845 84899 86768 86841 88259 88319 90160 91667
NR_027676.1 start end 10639 10779 11846** 11945 20183 20236 29429 29484* 31006 31094 31701 31840 36085 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 91667
U14680.1 start end 10593 10691 11847 11945 20183 20236 29429 29506 31006 31094 31701 31840 36082 36186 38672 38718 40040 40116 41102 44527 44930 45018 53387 53558 59348 61441 64724 68267 72011 72589 78827 84845 86768 88259 90160 59474 61631 65034 68354 72088 72629 78910 84899 86841 88319 90284
Table 4: Exon start and end positions generated by Spidey alignment of alternative transcripts against RefSeqGene NG_005905.1. Figures in red (with differences in exon positions that might be due to alignment errors. Figures in blue (with *) show real alternate start or end positions for exons.
**)
indicate small
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
44/58
EXON 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
U43746.1 Start 5000 5943 8598 14597 15622 15763 16020 18964 20440 21793 25786 34079 36348 44382 45949 47263 52044 52700 59922 60477 66191 68838 69271 69528 84210 86419 87683 89130
end 5188 6048 8846 14705 15671 15803 16134 19013 20551 22908 30717 34174 36417 44809 46130 47450 52214 53053 60078 60621 66312 69036 69434 69666 84454 86565 88731 89190
NM_000059.1 start end 5000 5188 5943 6048 8598 8846 14597 14705 15622 15671 15763 15803 16020 16134 18964 19013 20440 20551 21793 22908 25786 30717 34079 34174 36348 36417 44382 44809 45949 46130 47263 47450 52044 52214 52700 53053 59922 60078 60477 60621 66191 66312 68838 69036 69271 69434 69528 69666 84210 84454 86419 86565 87683 88731 89130 89190
NM_000059.3 start End 5001 5188 5943 6048 8598 8846 14597 14705 15622 15671 15763 15803 16020 16134 18964 19013 20440 20551 21793 22908 25786 30717 34079 34174 36348 36417 44382 44809 45949 46130 47263 47450 52044 52214 52700 53053 59922 60078 60477 60621 66191 66312 68838 69036 69271 69434 69528 69666 84210 84454 86419 86565 89193 * 87683 *
Table 5: Exon start and end positions generated by Spidey alignment of alternative transcripts against RefSeqGene NM_000059.3. Figures in blue (with *) show alternate start or end positions for exons
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
45/58
Exon number from Spidey alignment 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
NM_007294.2 1b 3 4 5 6 7 8 9 10 11b 12 13 14a 15 16 17 18 19 20 21 22 23 24
NM_007294.3 1a 2 3 4b 5 6 7a 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
NM_007297.3 1b 2 4b 5 6 7a 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
NM_007298.3 2 3 4b 5 6 7a 8 9 10a 11 12 14b 15 16 17 18 19 20 21 22 23 24
NM_007299.3 1b 2 3 4b 5 6 7a 8 9 10a 11 12 14b 15 16 17 18 19 20 21 23 24
NM_007300.3 1a 2 3 4b 5 6 7a 8 9 10b 11 12 13 14b 15 16 17 18 19 20 21 22 23 24
NR_027676.1 1c 2 3 4a 5 6 7b 8 9 10b 11 12 14a 15 16 17 18 19 20 21 22 23 24
U14680.1 1 2 3 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
Table 6: Alternative strategies for numbering exons in BRCA1 reference sequences.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
46/58
9.6
• •
Overview of BRCA1 and BRCA2 reference sequences.
There are LSDBs for both BRCA1 and BRCA2 that are using out of date reference sequences to describe variants. The same applies to sequences used for clinical testing. The use of the out of date sequences is part of the SOP used for analysis of these genes. Despite the fact that the annotation of NM_007294.2 states that it starts with exon 1B, this does not appear to be the case. Instead the difference between NM_007294.2 and NM_007294.3 appears to be that NM_007294.3 extends further into the 5’ UTR. There are minor differences between the 5’UTR of U14680.1 and the other BRCA1 reference sequences. Possibly these could be annotated in the updateable section of the LRG. There are at least five alternative transcripts for BRCA1. However, these do not appear to be used for describing variants, so there may be no need to include them in the LRG. However, it may be useful for researchers to know that variants that they are describing as exonic are in fact intronic in alternative transcripts. For example, if a variant used a stop codon or frameshift but is in a exon that is not included in all transcripts, this information might be considered significant by clinical scientists. However, the process of assessing alternative transcripts is not necessarily straightforward, the positions of exon start and stops provided in Genbank files may not be correct. For BRCA2 there are 399 intronic bases between exons 27 and 28 in U43746.1 and NM_000059.1 which are exonic in NM_000059.3. This does not alter the ORF but means that there could be problems integrating variants in this region Because there are no gaps within the coding regions of BRCA1 and BRCA2 alignments, the HGVS nomenclature given to the variants within the ORFs should be largely consistent between LSDBs and data from clinical labs.
• •
• • •
9.7
Survey of reference sequence users
Potential users of BRCA1 and BRCA2 LRGs were surveyed to identify which reference sequences they were using and whether they would be willing to use a LRG. The survey was sent out to the same groups as the GEN2PHEN software surveys discussed in section 8.5 of this report. In this case the response was rather better; thirty four responses were received despite this survey being rather longer than the software surveys. The questions and responses were as follows: Questions 1 and 2 concerned the names and organisations of the respondents. As well as replies from UK diagnostic labs, there were responses from Italy, Denmark, Finland, Czech Republic, Netherlands, Greece, Australia, France, USA, Norway and Spain.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
47/58
Question 3: Which BRCA1 reference sequence do you use? Reference sequence U14680 NM_007294.2 NM_007294.3 Another sequence Not answered Frequency 20 6 1 2 5
Question 4: If you answered “Another reference sequence” please enter the id number. “Previously to 2009 was used U14680, systematic HGVS nomenclature recommendations Ensembl: ENSG00000012048”. “L78833” Note, these are both genomic reference sequences. Question 5: Which BRCA2 reference sequence do you use? Reference sequence U43746 NM_000059.1 NM_000059.3 Another sequence Not answered Frequency 13 7 6 4 4
Question 6: If you answered “Another reference sequence” please enter the id number. “OTTHUMG00000017411 V.2” “Do not test BRCA2 in lab, but NM_000059.3 in some regions.” “Previously to 2005 was used U43746.1, from 2005 to 2008 NM_000059.1. systematic HGVS nomenclature recommendations Ensembl: ENSG00000139618” “AY436640” Note, as with BRCA1 these are genomic reference sequences. Question 7: If a LRG reference sequence was created for BRCA1 using NM_007294.3 would you migrate to using this sequence? Yes No Not answered Frequency 23 8 3
Question 8: If a LRG reference sequence was created for BRCA2 using NM_000059.3 would you migrate to using this sequence? Yes No Not answered
© Copyright 2011 GEN2PHEN Consortium
Frequency 23 9 2
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
48/58
Question 9: If you answered no; for questions 7 or 8, please state any reasons you have for not using LRGs. “I've answered yes but would prefer to say probably. This would depend on there being a consensus among the community that the LRG refseq was the standard to use” “This would require a significant amount of work for us. We would consider if it became the industry standard.” “Would do if it became a requirement” “The reference sequences we use are quoted on every document we use relating to BRCA analysis, and all our patient reports. To change the sequence would be a document control nightmare as everything would need updating to include the new accession numbers! Also any differences between the LRG sequence and the one we use (even intronic and/or irrelevant differences) would have to be changed on all analysis reference files used in the analysis and paper templates in the lab. This would take a huge amount of time and would probably of little or no value to the service. So unless the current sequences we use were shown to be inaccurate we would not change them.” “I think we should migrate to HGVS numbering” “I would have answered don't know had that been an option, as we are not currently familiar with them – however the accommodation of alternative amino acid and exon numbering systems could be useful, particularly for genes such as NF1 (exons) and MUTYH (amino acids)” “My answer for questions 7 and 8 is I don't know. I hadn't heard of LRG until now. It does seem like a good idea to standardise the reference sequences being used by UK diagnostic labs though.” “Do not test BRCA2 in lab” “These are the most frequently and first used sequences for BRCA1 and BRCA2 mutations description in the genetic test reports in Italy and world-wide as well as in the international reference database (BIC). Keeping of these reference sequences will ensure consistency in calls of the same mutation amongst different genetic testing Centres.” “We use the reference sequence used by the genetic testing laboratory, Myriad Genetics, since all of our variants come to us with that nomenclature and we don't need to figure out the other nomenclature for them.” “Probably yes, but it depends on the choice of the international scientific community” “We use the most up to date reference sequence. Need further information as to the benefit of having an LRG reference sequence before we alter our current process”.
9.8
Design of BRCA1 and BRCA2 LRGs.
The decision as to which sequences are to be included in a LRG is made by the LRG requesters. The minimum requirements for a LRG are listed at http://www.lrgsequence.org/page.php?page=contributions. It is intended that the owner of the BRCA1 and BRCA2 LRGs will be Larry Brody at the NHGRI, who is involved in the curation of the BIC database. Currently we are awaiting agreement between Larry Brody and NCBI as to the content of these LRGs. The LRGs for BRCA1 and BRCA2 will be based upon the RefSeqGene sequences NG_005905.2 (transcript id NM_007294.3) and NG_012772.1 (transcript id NM_000059.3). This has the advantages firstly that the LRG sequences will reflect our most up-to-date
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
49/58
knowledge of the BRCA genes; and secondly the LRG and RefSeqGenes will be in agreement. However, it also means that the reference sequences used by other LSDBs will be different from that used for LRG. In addition, as the results of the survey show, it will require some effort for diagnostic users to modify SOPs in order to use the LRGs. It is not clear how those who have been using different reference sequences will manage the process map their existing variants onto the LRG sequences. Although the analysis of BRCA1 alternate transcripts did indicate that there were different exon boundaries as well as alternate use of exon numberings the alignment of the sequences used by diagnostic laboratories and by LSDBs shows that they use the same splice product. While LRGs are designed to be able to store information on alternate transcripts, this is intended for use when the alternate transcripts are “necessary and used for reporting mutations and diagnostic purposes” (from minimal information set guidelines). In the case of BRCA1, while alternate transcripts are not used for reporting mutations, they might be useful for diagnostic purposes and so it is not clear whether or not they should be included in the LRG. There are also alternative numberings of the exons in different reference sequences which could be usefully stored in the LRG.
10
Summary
The aim of the pilot studies are to “objectively track progress and deficiencies in the GEN2PHEN project” and “provide assessment of the usefulness of the system in a ‘real-lifelike scenario’” (quoted text is from the GEN2PHEN Annex I - “Description of Work”). This report describes the ways in which clinical scientists in genetic testing laboratories work with G2P data in order to evaluate variant data. This can be summarised as follows: • Clinical scientists need to be able to integrate lines of evidence in order to produce an interpretation for a variant. • There is no standard method for integrating evidence but the UK Clinical Molecular Genetics Society’s guidelines for the interpretation of unclassified variants detail the evidence that should be considered. In addition, standards are being developed for LSDBs (Greenblat et al., 2008), the description of variant pathogenicity (Plon et al., 2008) and evidence regarding splice site prediction (Spurdle et al., 2008). • GEN2PHEN software will provide mechanisms to allow data to be shared and integrated. Will the data standards that are being developed be able to deal with issues of different LSDBs providing different assessments of pathogenicity? Data standards can ensure that LSDBs use the same terms to describe pathogenicity (e.g. the IARC classification terms discussed in Section 3.1). However, they may not be able to influence the quality of pathogenicity assessments, as shown by the variability of the published assessments discussed in Section 4. As well as data standards, there may be a need for data curation standards, to either remove or highlight lower-quality assessments. This may be less of an issue for diagnostic data from genetic testing laboratories, where the adherence to an SOP should ensure the quality of the assessment. However, it is also the case that since this data is unlikely to be published in a journal; other users will be less able to make their own assessments of dataquality.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
50/58
• • •
•
•
•
Will the development of data standards prevent and the same variant identification being integrated more than once? Clinical scientists analyse variants according to SOPs and in to order have GEN2PHEN software adopted by the diagnostic community, the software will have to find a place within the SOPs. It was our intention to use feedback from the clinical community about GEN2PHEN software but this has not been easy to obtain, either by emailed questionnaires or by face to face discussions at conferences. We need to consider how best to go about engaging with this community in future, both to get feedback about GEN2PHEN and also to publicise the software and get it adopted. The reason for the lack of feedback might be explained by the fact that some of the GEN2PHEN software is still in development and there is as yet no user community. Possibly we need to establish a “critical mass” of users in order to be able generate feedback. However, we should also consider whether there is a mismatch between the needs of the diagnostic/clinical users and what some of the GEN2PHEN tools aim to do. Clinical scientists are clearly not the only users of G2P data and it may be that other users simply need a list of variants within a gene without supporting evidence for pathogenicity. However, it is also the case that the clinical community generates large amounts of data and that at present most of this data is not made available to the wider community. In order to encourage the sharing of this data it is important that the tools available are attractive to the community and demonstrate clear benefits. At present, the integration of G2P is a manual process performed by clinical scientists on a “one variant at a time” basis. The development of NGS based clinical genetics will mean that there will need to be far more automation of this process by analysis pipelines. We need to give consideration as to how GEN2PHEN software will function within these pipelines.
10.1 Barriers to data integration and GEN2PHEN solutions
The vision of the GEN2PHEN project is shown in figure 24, reproduced from the GEN2PHEN Description of Work document (Figure 2, page 11).
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
51/58
Figure 24: G2P Databases, Current and Future As stated in the Description of Work the problem with G2P databases is that there is “no convenient way to populate the databases, no easy way to exchange or compare or integrate the different resources, and absolutely no way to search the totality of gathered information. It is fragmented, disorganised, and highly inefficient.” GEN2PHEN will develop “a broad array of G2P databases (shown in dotted outlines), all constructed from common principles and standards via open-source software (hence all uniformly coloured white), so enabling widespread interconnectivity in the resulting ‘G2P Knowledge Network’. We will take various measures to bring the existing G2P databases into this network, especially current LSDBs.” For the BRCA1 and BRCA2 databases discussed in this Pilot Study, there are several barriers to data integration: • How are LSDB resources identified? As discussed in this report, maintaining an upto-date LSDB listing is a difficult task. • Is data suitably available for integration? Several of the BRCA databases are password protected and the curators may not wish to make the data publicly available. • The HGVS guidelines for naming variants have not been followed by several databases and reformatting data will be necessary. • How will the use of different reference sequences to describe variants affect integration? For example, do U14680.1:c.211A>G and NM_007294.3:c.211A>G refer to the same variant? At present, in order to be certain that they are the same, the two reference sequences need to be aligned and the region around 211A compared. • How can data relating to the pathogenicity of variants be combined if (1) LSDBs are use unrelated terms to describe pathogenicity (e.g. “1-5”, “+,- or ?”, “yes, no or uncertain”); and (2) if standards for describing the probability of a variant being pathogenic are not used?
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
52/58
•
How do we avoid data being integrated more than once? For example, there are several LSDBs for the HNPCC gene MLH1. The data from the MMR database has been included in dbSNP. It has also been added to the InSiGHT database. If InSiGHT data were also to be added to dbSNP, how would the re-inclusion of variants already in dbSNP be prevented?
To what extent will GEN2PHEN software help to overcome these barriers? • As mentioned in Section 8.2, GEN2PHEN is developing a unified list of LSDBs, we need to establish how complete this list is, and how easy it is to keep it up to date. If new LSDB resources are established, how will GEN2PHEN become aware of them? • The development of LRGs will provide an unchanging reference to describe variants with. To what extent will LRGs help the problem of integrating sequences identified using alternate reference sequences? • The VarioML model includes details of a variant’s pathogenicity. There VarioML website (https://svn.gene.le.ac.uk/gen2phen/trunk/data_formats/xml/html/doc.html) mentions an evidence ontology. How will the non-standard descriptions of variant pathogenicity in different databases be modelled using VarioML? • Is the problem of variants being re-integrated into repositories one that can be solved using data identifiers, of the type being develop to identify researchers?
11
References
Aarnio M, Sankila R, Pukkala E, Salovaara R, Aaltonen LA, de la Chapelle A, Peltomäki P, Mecklin JP, Järvinen HJ. 1999. Cancer risk in mutation carriers of DNA-mismatch-repair genes. Int. J. Cancer. 81: 214-8. Auranen A, Song H, Waterfall C, Dicioccio RA, Kuschel B, Kjaer SK, Hogdall E, Hogdall C, Stratton J, Whittemore AS, Easton DF, Ponder BA, Novik KL, Dunning AM, Gayther S, Pharoah PD. 2005. Polymorphisms in DNA repair genes and epithelial ovarian cancer risk. Int. J. Cancer. 117: 611-8. Bilgüvar K, Oztürk AK, Louvi A, Kwan KY, Choi M, Tatli B, Yalnizoğlu D, Tüysüz B, Cağlayan AO, Gökben S, Kaymakçalan H, Barak T, Bakircioğlu M, Yasuno K, Ho W, Sanders S, Zhu Y, Yilmaz S, Dinçer A, Johnson MH, Bronen RA, Koçer N, Per H, Mane S, Pamir MN, Yalçinkaya C, Kumandaş S, Topçu M, Ozmen M, Sestan N, Lifton RP, State MW, Günel M. 2010. Whole-exome sequencing identifies recessive WDR62 mutations in severe brain malformations. Nature. 467: 207-10. Brose MS, Volpe P, Paul K, Stopfer JE, Colligon TA, Calzone KA, and Weber BL. 2004. Characterization of Two Novel BRCA1 Germ-Line Mutations Involving Splice Donor Sites. Genetic Testing 8:133-138 Burk-Herrick A, Scally M, Amrine-Madsen H, Stanhope MJ, Springer MS. 2006. Natural selection and mammalian BRCA1 sequences: elucidating functionally important sites relevant to breast cancer susceptibility in humans. Mamm. Genome. 17: 257-70.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
53/58
Chakravarti A. 1999. Population genetics-making sense out of sequence. Nat. Genet. 21: 5660. Chiu RW, Akolekar R, Zheng YW, Leung TY, Sun H, Chan KC, Lun FM, Go AT, Lau ET, To WW, Leung WC, Tang RY, Au-Yeung SK, Lam H, Kung YY, Zhang X, van Vugt JM, Minekawa R, Tang MH, Wang J, Oudejans CB, Lau TK, Nicolaides KH, Lo YM. 2011. Noninvasive prenatal assessment of trisomy 21 by multiplexed maternal plasma DNA sequencing: large scale validity study. BMJ Cirulli ET, Goldstein DB. 2010. Uncovering the roles of rare variants in common disease through whole-genome sequencing. Nat Rev Genet. 11: 415-25. Costa FF. 2010. Epigenomics in cancer management. Cancer Manag. Res. 2:255-65. Cotton RG, Auerbach AD, Beckmann JS, Blumenfeld OO, Brookes AJ, Brown AF, Carrera P, Cox DW, Gottlieb B, Greenblatt MS, Hilbert P, Lehvaslaiho H, Liang P, Marsh S, Nebert DW, Povey S, Rossetti S, Scriver CR, Summar M, Tolan DR, Verma IC, Vihinen M, den Dunnen JT. 2008. Recommendations for locus-specific databases and their curation. Hum. Mutat. 29: 2-5. Cox DG, Kraft P, Hankinson SE, Hunter DJ. 2005 Haplotype analysis of common variants in the BRCA1 gene and risk of sporadic breast cancer. Breast Cancer Res. 7: R171-5. Díez O, Cortés J, Domènech M, Brunet J, Del Río E, Pericay C, Sanz J, Alonso C, Baiget M. 1999.BRCA1 mutation analysis in 83 Spanish breast and breast/ovarian cancer families. Int. J. Cancer 83: 465-9. Díez O, Osorio A, Durán M, Martinez-Ferrandis JI, de la Hoya M, Salazar R, Vega A, Campos B, Rodríguez-López R, Velasco E, Chaves J, Díaz-Rubio E, Jesús Cruz J, Torres M, Esteban E, Cervantes A, Alonso C, San Román JM, González-Sarmiento R, Miner C, Carracedo A, Eugenia Armengod M, Caldés T, Benítez J, Baiget M. 2003 Analysis of BRCA1 and BRCA2 genes in Spanish breast/ovarian cancer patients: a high proportion of mutations unique to Spain and evidence of founder effects. Hum. Mutat. 22: 301-12. FitzGerald MG, Marsh DJ, Wahrer D, Bell D, Caron S, Shannon KE, Ishioka C, Isselbacher KJ, Garber JE, Eng C, Haber DA. 1998. Germline mutations in PTEN are an infrequent cause of genetic predisposition to breast cancer. Oncogene. 17: 727-31. Fleming MA, Potter JD, Ramirez CJ, Ostrander GK, Ostrander EA. 2003 Understanding missense mutations in the BRCA1 gene: an evolutionary approach. Proc Natl. Acad. Sci. USA. 100: 1151-6. Flusberg BA, Webster DR, Lee JH, Travers KJ, Olivares EC, Clark TA, Korlach J, Turner SW. 2010. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods. 7: 461-5.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
54/58
Goldgar DE, Easton DF, Byrnes GB, Spurdle AB, Iversen ES, Greenblatt MS; IARC Unclassified Genetic Variants Working Group. 2008. Genetic evidence and integration of various data sources for classifying uncertain variants into a single model. Hum. Mutat. 29: 1265-72. Goldstein DB. 2009. Common genetic variation and human traits. N. Engl. J. Med. 360: 1696-8. Gonzalez KD, Noltner KA, Buzin CH, Gu D, Wen-Fong CY, Nguyen VQ, Han JH, Lowstuter K, Longmate J, Sommer SS, Weitzel JN. 2009. Beyond Li Fraumeni Syndrome: clinical characteristics of families with p53 germline mutations. J Clin Oncol. 27: 1250-6. Górski1 B, Narod SA and Lubiński J. 2005. A common missense variant in BRCA2 predisposes to early onset breast cancer. Breast Cancer Research 7: R1023-R1027 Greely HT.2011. Get ready for the flood of fetal gene screening. Nature 469: 289-91. Greenblatt MS, Brody LC, Foulkes WD, Genuardi M, Hofstra RM, Olivier M, Plon SE, Sijmons RH, Sinilnikova O, Spurdle AB; IARC Unclassified Genetic Variants Working Group. 2008. Locus-specific databases and recommendations to strengthen their contribution to the classification of variants in cancer susceptibility genes. Hum. Mutat. 29: 1273-81. Greenman J, Mohammed S, Ellis D, Watts S, Scott G, Izatt L, Barnes D, Solomon E, Hodgson S, Mathew C. 1998. Genes Chromosomes Cancer. 21: 244-9. Guttmacher AE, McGuire AL, Ponder B, Stefánsson K. 2010. Personalized genomic information: preparing for the future of genetic medicine. Nat. Rev. Genet. 11: 161-5. Hearle N, Schumacher V, Menko FH, Olschwang S, Boardman LA, Gille JJ, Keller JJ, Westerman AM, Scott RJ, Lim W, Trimbath JD, Giardiello FM, Gruber SB, Offerhaus GJ, de Rooij FW, Wilson JH, Hansmann A, Möslein G, Royer-Pokora B, Vogel T, Phillips RK, Spigelman AD, Houlston RS. 2006. Frequency and spectrum of cancers in the Peutz-Jeghers syndrome. Clin. Cancer Res. 12: 3209-15. Hadjisavvas A, Charalambous E, Adamou A, Neuhausen SL, Christodoulou CG, Kyriacou K. 2004. Hereditary breast and ovarian cancer in Cyprus: identification of a founder BRCA2 mutation. Cancer Genet. Cytogenet. 151: 152-6. Janezic SA, Ziogas A, Krumroy LM, Krasner M, Plummer SJ, Cohen P, Gildea M, Barker D, Haile R, Casey G, Anton-Culver H. 1999 Germline BRCA1 alterations in a population-based series of ovarian cancer cases. Hum. Mol. Genet. 8: 889-97. Johnson N, Fletcher O, Palles C, Rudd M, Webb E, Sellick G, dos Santos Silva I, McCormack V, Gibson L, Fraser A, Leonard A, Gilham C, Tavtigian SV, Ashworth A, Houlston R, Peto J. 2007 Counting potentially functional variants in BRCA1, BRCA2 and ATM predicts breast cancer susceptibility. Hum. Mol. Genet. 16:1051-7.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
55/58
Kibriya MG, Jasmine F, Argos M, Andrulis IL, John EM, Chang-Claude J, Ahsan H. 2009. A pilot genome-wide association study of early-onset breast cancer. Breast Cancer Res. Treat. 114: 463-77. Lee TC, Lee AS, Li KB. 2008. Incorporating the amino acid properties to predict the significance of missense mutations. Amino Acids. 35: 615-26. McKean-Cowdin R, Spencer Feigelson H, Xia LY, Pearce CL, Thomas DC, Stram DO, Henderson BE. 2005. BRCA1 variants in a family study of African-American and Latina women. Hum. Genet. 116: 497-506. Matos S, Arrais JP, Maia-Rodrigues J, Oliveira JL. 2010. Concept-based query expansion for retrieving gene related publications from MEDLINE. BMC Bioinformatics. 11: 212. Menzel HJ, Sarmanova J, Soucek P, Berberich R, Grünewald K, Haun M, Kraft HG. 2004 Association of NQO1 polymorphism with spontaneous breast cancer in two independent populations. Br. J. Cancer. 90: 1989-94. Miki Y, Swensen J, Shattuck-Eidens D, Futreal PA, Harshman K, Tavtigian S, Liu Q, Cochran C, Bennett LM, Ding W, et al. 1994 A strong candidate for the breast and ovarian cancer susceptibility gene BRCA1. Science. 266: 66-71. Mitropoulou C, Webb AJ, Mitropoulos K, Brookes AJ, Patrinos GP. 2010. Locus-specific database domain and data content analysis: evolution and content maturation toward clinical use. Hum. Mutat. 31: 1109-16. Morris JR, Pangon L, Boutell C, Katagiri T, Keep NH, Solomon E. 2006. Genetic analysis of BRCA1 ubiquitin ligase activity and its relationship to breast cancer susceptibility. Hum. Mol. Genet. 15: 599-606. Need AC, Goldstein DB. 2010. Whole genome association studies in complex diseases: where do we stand? Dialogues Clin Neurosci. 12: 37-46. Ng SB, Buckingham KJ, Lee C, Bigham AW, Tabor HK, Dent KM, Huff CD, Shannon PT, Jabs EW, Nickerson DA, Shendure J, Bamshad MJ. 2010. Exome sequencing identifies the cause of a mendelian disorder. Nat. Genet. 42: 30-5. Orban TI, Olah E. 2003. Emerging roles of BRCA1 alternative splicing. Mol. Pathol. 56: 191-7. Pettigrew C, Wayte N, Lovelock PK, Tavtigian SV, Chenevix-Trench G, Spurdle AB, Brown MA. 2005. Evolutionary conservation analysis increases the colocalization of predicted exonic splicing enhancers in the BRCA1 gene with missense sequence changes and in-frame deletions, but not polymorphisms. Breast Cancer Res. 7: R929-39. Plon SE, Eccles DM, Easton D, Foulkes WD, Genuardi M, Greenblatt MS, Hogervorst FB, Hoogerbrugge N, Spurdle AB, Tavtigian SV; IARC Unclassified Genetic Variants Working
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
56/58
Group. 2008. Sequence variant classification and reporting: recommendations for improving the interpretation of cancer susceptibility genetic test results. Hum. Mutat. 29: 1282-91. Rahman N, Seal S, Thompson D, Kelly P, Renwick A, Elliott A, Reid S, Spanova K, Barfoot R, Chagtai T, Jayatilake H, McGuffog L, Hanks S, Evans DG, Eccles D; Breast Cancer Susceptibility Collaboration (UK), Easton DF, Stratton MR. 2007. Nat. Genet. 39: 165-7. Read A and Donnai D. 2010. New Clinical Genetics, 2nd Edition. Scion Publishing Ltd Reich DE, Lander ES. 2001. On the allelic spectrum of human disease. Trends Genet. 17: 502-10. Ruffner H, Joazeiro CA, Hemmati D, Hunter T, Verma IM. 2001. Cancer-predisposing mutations within the RING domain of BRCA1: loss of ubiquitin protein ligase activity and protection from radiation hypersensitivity. Proc. Natl. Acad. Sci. U S A. 98: 5134-9. Santos C, Peixoto A, Rocha P, Vega A, Soares MJ, Cerveira N, Bizarro S, Pinheiro M, Pereira D, Rodrigues H, Castro F, Henrique R, Teixeira MR. 2009. Haplotype and quantitative transcript analyses of Portuguese breast/ovarian cancer families with the BRCA1 R71G founder mutation of Galician origin. Fam. Cancer. 8: 203-8. Schoumacher F, Glaus A, Mueller H, Eppenberger U, Bolliger B, Senn HJ. 2001. BRCA1/2 mutations in Swiss patients with familial or early-onset breast and ovarian cancer. Swiss Med. Wkly. 131: 223-6. Schrader KA, Masciari S, Boyd N, Wiyrick S, Kaurah P, Senz J, Burke W, Lynch HT, Garber JE, Huntsman DG. 2008. Hereditary diffuse gastric cancer: association with lobular breast cancer. Fam. Cancer. 7: 73-82. Seal S, Thompson D, Renwick A, Elliott A, Kelly P, Barfoot R, Chagtai T, Jayatilake H, Ahmed M, Spanova K, North B, McGuffog L, Evans DG, Eccles D; Breast Cancer Susceptibility Collaboration (UK), Easton DF, Stratton MR, Rahman N. 2006. Truncating mutations in the Fanconi anemia J gene BRIP1 are low-penetrance breast cancer susceptibility alleles. Nat. Genet. 38: 1239-41. Seymour IJ, Casadei S, Zampiga V, Rosato S, Danesi R, Falcini F, Strada M, Morini N, Naldoni C, Paradiso A, Tommasi S, Schittulli F, Amadori D, Calistri D. 2008. Disease family history and modification of breast cancer risk in common BRCA2 variants. Oncol. Rep. 19: 783-6. Spurdle AB, Couch FJ, Hogervorst FB, Radice P, Sinilnikova OM; IARC Unclassified Genetic Variants Working Group. 2008. Prediction and assessment of splicing alterations: implications for clinical testing. Hum. Mutat. 29: 1304-13. Strachan DP, Rudnicka AR, Power C, Shepherd P, Fuller E, Davis A, Gibb I, Kumari M, Rumley A, Macfarlane GJ, Rahi J, Rodgers B, Stansfeld S. 2007. Int. J. Epidemiol. 36: 52231.
© Copyright 2011 GEN2PHEN Consortium
D1.5 Intermediate Report from Project Assessment Pilot
WP1: Scientific Coordination
HEALTH-200754
Author(s): Michael Cornell (UNIMAN)
Security: PU Version: v1.2 –
Final
57/58
Szabo C, Masiello A, Ryan JF, Brody LC. 2000. The breast cancer information core: database design, structure, and scope. Hum. Mutat. 16: 123-31. Tavtigian SV, Deffenbaugh AM, Yin L, Judkins T, Scholl T, Samollow PB, de Silva D, Zharkikh A, Thomas A. 2006 Comprehensive statistical study of 452 BRCA1 missense substitutions with classification of eight recurrent substitutions as neutral. J. Med. Genet. 43: 295-305. Tavtigian SV, Byrnes GB, Goldgar DE, Thomas A. 2008. Classification of rare missense substitutions, using risk surfaces, with genetic- and molecular-epidemiology applications. Hum. Mutat. 29: 1342-54. Tommasi S, Pilato B, Pinto R, Monaco A, Bruno M, Campana M, Digennaro M, Schittulli F, Lacalamita R, Paradiso A. 2008 Molecular and in silico analysis of BRCA1 and BRCA2 variants. Mutat. Res. 644: 64-70. Vega A, Campos B, Bressac-De-Paillerets B, Bond PM, Janin N, Douglas FS, Domènech M, Baena M, Pericay C, Alonso C, Carracedo A, Baiget M, Diez O. 2001. The R71G BRCA1 is a founder Spanish mutation and leads to aberrant splicing of the transcript. Hum Mutat. 17: 520-1. Vissers LE, de Ligt J, Gilissen C, Janssen I, Steehouwer M, de Vries P, van Lier B, Arts P, Wieskamp N, Del Rosario M, van Bon BW, Hoischen A, de Vries BB, Brunner HG, Veltman JA. 2010. A de novo paradigm for mental retardation. Nat. Genet. 42: 1109-12. Walsh T, Lee MK, Casadei S, Thornton AM, Stray SM, Pennil C, Nord AS, Mandell JB, Swisher EM, King MC. 2010. Detection of inherited mutations for breast and ovarian cancer using genomic capture and massively parallel sequencing. Proc. Natl. Acad. Sci. U S A. 107: 12629-33. Weedon MN, Lango H, Lindgren CM, Wallace C, Evans DM, Mangino M, Freathy RM, Perry JR, Stevens S, Hall AS, Samani NJ, Shields B, Prokopenko I, Farrall M, Dominiczak A; Diabetes Genetics Initiative; Wellcome Trust Case Control Consortium, Johnson T, Bergmann S, Beckmann JS, Vollenweider P, Waterworth DM, Mooser V, Palmer CN, Morris AD, Ouwehand WH; Cambridge GEM Consortium, Zhao JH, Li S, Loos RJ, Barroso I, Deloukas P, Sandhu MS, Wheeler E, Soranzo N, Inouye M, Wareham NJ, Caulfield M, Munroe PB, Hattersley AT, McCarthy MI, Frayling TM. 2008. Genome-wide association analysis identifies 20 loci that influence adult height. Nat. Genet. 40: 575-83. Wenham RM, Schildkraut JM, McLean K, Calingaert B, Bentley RC, Marks J, Berchuck A. 2003. Polymorphisms in BRCA1 and BRCA2 and risk of epithelial ovarian cancer. Clin. Cancer Res. 9: 4396-403.
© Copyright 2011 GEN2PHEN Consortium
This document is © 2011 by acaciareiche - all rights reserved.
Tags:
- Login to post comments
