Researcher Identification Primer

A number of ostensibly separate initiatives, with diverse objectives, have begun considering the risks, benefits, and practicalities of unambiguously identifying researchers as they use and contribute to biomedical data sources on the Internet. The GEN2PHEN project is one such initiative, given its general aim of helping to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data. More specifically, the GEN2PHEN project considers researcher identification to be an absolutely central part of how biomedical databasing, and scientific reporting in general, needs to be developed.

At the heart of this lies the concept of a user-centric system for researcher identification – in simple terms, one or more ‘ID systems’ by which individuals can be unambiguously identified along with various types of information that is associated with them, and where the individual controls his/her online identity and how/where it is used. Key examples of research-related activities and services that would benefit from such a system would include:

Practical options for the global management of access privileges to sensitive datasets. Read more.
Disambiguation of author names in the scientific literature and establishing/validating relationships between authors and publications. Read more .
A solid foundation for permitting and tracking online scientific contributions, such as database submissions, scientific blogging, and community curation efforts. Read more.
Security in Semantic Web applications.
Biobanking applications, including services enabling individuals to track how data from studies they have participated in are used.
Knowledge discovery applications using some or all of the above components.

At present, key Web 2.0 Internet technologies which can underpin such a system (e.g., OpenID - a decentralized, open authentication protocol, further described on this page), are being widely adopted. To advance this field, a community of key stakeholders (e.g., GEN2PHEN, P3G, HUGO, HVP) has been assembled and is continually growing. This group is exploring innovative ways to exploit this new Internet ecosystem to support research-related activities and services.

Please feel free to peruse the information in this primer, presented as a series of mini-essays listed on the menu to the left. Hopefully this information will help you become more aware of allied projects in the field, and the latest relevant technologies.

Note: These pages are an expanded version of the original whitepaper, initially put together by members of the GEN2PHEN project and circulated via E-mail in February-March 2009. We hope the ideas and issues raised stimulate debate and we would welcome contributions/corrections. Feel free to , or participate in the discussion by posting comments on these pages, or joining our discussion forums.

Sensitive datasets, data privacy, and access control

Table of contents

Case study: individual-level data from genome-wide association studies
Case study: aggregate data from genome-wide association studies
A registry for users of biomedical data
Granularity of data access permissions
Conclusions

Investigations into clinical materials, especially high-throughput experiments and genetic epidemiology studies using thousands of individuals, generate data from which study participants can be identified. In order to protect these individuals from potential misuse of the data generated about them (e.g. discrimination by health insurance providers or potential employers), the dissemination of these data must be carefully controlled and involves many stakeholders (see e.g. ref. [fn]Foster et al. Share and share alike: deciding how to distribute the scientific and social benefits of genomic data. Nature Reviews Genetics (2007) vol. 8 (8) doi:10.1038/nrg2360[/fn]). But this will become increasingly costly and difficult to manage on a case by case basis, given increases in; the number of such studies; the number of groups/consortia generating such datasets; the number of databases wishing to integrate and disseminate the information; and the number of researchers wishing to access these data.

Case study: individual-level data from genome-wide association studies

Currently, to gain access to genotype data from genome-wide association studies (GWAS) from the Wellcome Trust Case-Control Consortium (WTCCC), one must complete a special form, wait up to 2 months for approval from the relevant Data Access Committee, and sign a Data Access Agreement. The researcher is then allowed to download encrypted files from the European Genotype Archive (EGA) website to his computer, and must decrypt these files with a provided key. NCBI’s database of Genotypes and Phenotypes (dbGaP) has similar procedures in place.

While there are good reasons for these measures, they already impede the rate of research progress, and will increasingly do so as opportunities for broad dataset integration and meta-analysis become ever more curtailed due to limitations on access. Simply extending the current system will not change the core fact that access permissions must to be applied for per dataset/project, making it very onerous for researchers who need to access many datasets from multiple sources. Also, as the researcher must download the data to his local computer, the system does not scale up to future applications where data integration will take place on-the-fly across many diverse data sources on the Internet. Therefore, even though the primary data in question are in principle available to researchers, the potential from data reuse (e.g. data mining, secondary analyses) is greatly diminished by current dissemination practices.

Case study: aggregate data from genome-wide association studies

Aggregate representations of individual genotype information (genotype/allele frequencies, aka genotype summaries) from GWAS's were until recently distributed without restrictions, based on the assumption that this level of detail does not enable re-identification of individuals within the group. This enabled secondary data providers, or portals (such as HGVbaseG2P) to collect these data and present to end users via special-purpose genome views and search modalities, thus adding value to the original results.

However, in a recent paper[fn]Homer et al. Resolving individuals contributing trace amounts of DNA to highly complex mixtures using high-density SNP genotyping microarrays. PLoS Genet (2008) vol. 4 (8) doi:10.1371/journal.pgen.1000167[/fn] the authors show that given a high-density genetic profile for an individual, it is possible to work out whether the individual participated in a genetic association study, even if only aggregate genotype data are available from the study. As a result of these findings, various data providers and funders have effectively halted unrestricted sharing of aggregate data (see e.g. response from NIH[fn]Zerhouni et al. Protecting aggregate genomic data. Science (2008) vol. 322 (5898) doi:10.1126/science.1165490[/fn]) and now full individual-level data access privileges are required to get to the aggregate data. As a consequence of this, secondary data providers now cannot present or re-distribute aggregate GWAS data without greatly restricting the amount of information shown at one time, severely limiting the value such projects could otherwise add.

A registry for users of biomedical data

The whole process would obviously be greatly streamlined if one or more services (probably operated by major regional data centres such as WTSI and NCBI) were to store information on access privileges for each researcher based on an OpenID that he would provide upon registration. The registry (or registries) could then be used by various primary and secondary data providers (whether or not part of the WTCC/SI and NCBI) to check whether or not a person should be allowed access to a given type of sensitive dataset. The same registry could also be used to ‘blacklist’ individuals found guilty of inappropriate use of data (though the complex issue of sanctions needs much further consideration, whatever mechanism for access approval is in operation).

Granularity of data access permissions

The registry and participating data providers could have different levels of granularity for access permissions. For example, in the simplest scenario a person who is listed in the registry (thereby confirming his status as a researcher) could be given 'blanket' access to quasi-sensitive data (such as aggregate genotypes, as outlined in the case study above). System(s) enabling this could be developed relatively quickly, and thus this strategy may serve as an interim solution to the acute aggregate data sharing problem.

In a more complex scenario involving individual-level data, a researcher could be granted access to all datasets from a particular archive (e.g. dbGaP), or all data from a particular consortium which has submitted several datasets to one or more archives (e.g. all WTCCC data). Finally, a researcher could be given access to only a particular dataset (e.g. WTCCC bipolar study).

Conclusions

Current practices for disseminating sensitive biomedical data are onerous and will not scale to support future large-scale data integration tasks. An online registry or registries of researchers and their data access privileges will be key components in streamlining this process, both for acquiring data access permits initially and for accessing the data from primary as well as secondary data providers. Such a framework will require researchers to prove that they are who they say they are in a robust way across multiple websites, which in turn requires the adoption of a universal authentication system (as described in this section of our series).

Author names and authorship in scientific publications

Table of contents

Name confusion: who published what?
Existing initiatives
Contributor identifier service + OpenID

Ambiguity in author names for scholarly publications has long been a problem in science. Several commercial and non-commercial efforts have been launched to counter this problem, where authors are assigned unique identifiers and ambiguities are resolved, in part by enlisting the help of authors themselves. However, the value of these services will be limited unless they can be linked with one another and other services via a common authentication system.

Name confusion: who published what?

Ambiguity in author names for scholarly publications has long been a problem in science. Multiple authors can have the same name and authors sometimes change their name (e.g. women marrying and taking their husband’s family name). This can in result in inaccurate literature searches, the wrong person being to be asked to peer-review a paper, and a host of other problems. This is particularly pronounced for non-English authors from countries such as India or China where a large number of individuals share the same family name, a situation made worse when different names end up being spelt in different ways when converted into English[fn]Jane Qiu. Scientific publishing: Identity crisis. Nature News (2008) vol. 451 (7180) doi:10.1038/451766a[/fn].

Existing initiatives

Unique author identifiers have been suggested to resolve this problem[fn]Falagas. Unique Author Identification Number in Scientific Databases: A Suggestion. PLoS Med (2006) vol. 3 (5) doi:10.1371/journal.pmed.0030249[/fn], and two commercial services by major publishers, ResearcherID by Thomson-Reuters (see also comment in The Lancet[fn]Cals and Kotz. Researcher identification: the right needle in the haystack. Lancet (2008) vol. 371 (9631)doi:10.1016/S0140-6736(08)60931-9[/fn]) and Scopus Author Identifier by Elsevier, are attempts at doing just that. The non-profit CrossRef organization, which runs the DOI cross-publisher citation linking system, is also working on a system for contributor identifers, provisionally named CrossReg (G. Bilder, personal communication). Whether run by a single organization or multiple organizations/companies, an open contributor identifier service or multiple linked services (hereafter referred to as simply CrossReg, for convenience) would be valuable on many levels in scientific publishing, just as DOIs have done for the publications themselves.

Contributor identifier service + OpenID

So how may authors benefit from such a system? Apart from the obvious advantage of author name disambiguation, one important answer to this concerns centralized author profile management: given that a researcher has registered with CrossReg to claim his profile (and in the process supplied proof that he is who he claims to be), he could then associate his OpenID (see previous section) with his/her contributor ID. This would enable a host of new possibilities, such as logging on to a publisher’s website via OpenID (e.g. in order to submit a manuscript) and allow the publisher to securely retrieve the author’s current affiliation and other profile information from the CrossReg service. See more on this page.

Incentives/rewards for scientific contributions

Table of contents

Microcredit/microattribution
Submissions to biological databases
Construction/maintenance of databases
Conclusions

The traditional way to gauge a researcher’s scientific prowess is to look at his publication record in peer-reviewed journals, and use crude, imperfect metrics like the ISI Impact Factor (IF) as a measure of the quality of these journals. But there are many other ways, besides authoring traditional papers, in which researchers contribute to science. Submissions to biological databases, curation of data in those databases, Web 2.0 activities like scientific blogging, online commenting on and rating of scientific papers (pioneered by PLoS) represent examples of activities for which researchers get little or no credit for at present. Here we will explore how a microattribution scheme providing incentives/rewards could help in this regard and outline some examples.

Microcredit/microattribution

If the various contributions (such as those listed above) can be tracked and linked to the identity of each researcher via his OpenID (see previous section), what would then gradually emerge is a web of publication credit-like (aka ‘microattribution’) information which can be mined and aggregated to produce far more useful metrics of individual scientific contribution than is possible today (see e.g. Scholar Factor as proposed in recent paper in PLoS[fn]Bourne et al. I am not a scientist, I am a number. PLoS Comput Biol (2008) vol. 4 (12) doi:10.1371/journal.pcbi.1000247[/fn]). These ideas are being further developed in the guise of a BioMedical Resource Impact Factor (BRIF)[fn]Cambon-Thomsen. Assessing the impact of biobanks. Nature Genetics (2003) vol. 34 (1) doi:10.1038/ng0503-25b[/fn], which is heavily centered on the needs and activities of the Biobanking community.

Submissions to biological databases

Database submissions are often driven by journal and/or funder requirements; i.e. in order to get a manuscript accepted (or research funded), the authors must submit their primary data to the appropriate archives (e.g. DNA sequences to GenBank/EMBL/DDBJ) and then cite submission accession numbers in their manuscript. For some categories of data, mainly DNA sequences and gene expression microarray experiments, this arrangement is now well established and the data are relatively standardized.

But for other kinds of data which have emerged quite recently (e.g. results from genome-wide association studies, see previous section) this is not the case, and papers are published every month where the underlying data are not made available. There can be several reasons for this, but key factors are undoubtedly that i) journals/funders do not yet require the data to be submitted to archives, and ii) researchers are not rewarded for submitting data.

Construction/maintenance of databases

For many kinds of biological databases, maintenance involves manual curation of the data contents. Biocurators verify data correctness (e.g. automated gene structure predictions), enhance data by adding related information (e.g. from literature mining) or cross-reference with other databases, and so on. Such work goes largely unnoticed as the output cannot usually be measured in journal publications, prompting calls for a robust infrastructure for recognition of curation work. This will be particularly important to facilitate community curation of large-scale datasets which are too large to be tackled by curator teams[fn]Howe et al. Big data: The future of biocuration. Nature (2008) vol. 455 (7209) doi:10.1038/455047a[/fn].

Additionally, funding for construction, and particularly long-term maintenance, of biological databases is hard to secure[fn]Merali et al. Databases in peril. Nature (2005) vol. 435 (7045) doi:10.1038/nrg2483[/fn]. It would be valuable for database maintainers if they could unequivocally show reviewers how useful their data content is to the community, by way of accurate citation metrics for datasets (possibly right down to the level of individual database records). This would be a far more useful metric of the scientific value of a database than simple website traffic statistics and/or citations to the paper describing the database as a whole).

Conclusions

Constructing a microcredit-tracking system or systems for incentives/rewards is entirely feasible technically, but success is heavily dependent on researchers being able to identify themselves uniquely (and securely) to the various online services involved. Furthermore, a key aspect of this is that the user should always have the choice whether or not to use his primary public identity for these activities (i.e. in situations where anonymity is preferred). The last section in this primer will discuss some scenarios where a user-centric system (as previously presented) can underpin a loosely-connected network of services which can go a long way towards achieving this goal.

Tying it all together

Table of contents

Case study: data access control
Authentication vs authorization
Case study: microcredit tracking
Aggregating information to populate a professional profile
Conclusions

In previous sections of this primer we introduced several aspects of the identification problem and outlined scenarios where a universal authentication system forms an integral part of the solution (see Figure 1 below). Now it is time to investigate how this may work in practice, which technologies (in addition to OpenID) might play a role, and how a researcher might leverage these tools to aggregate information about himself in a meaningful way.

Tying it all together

Case study: data access control

Imagine a researcher working for company X who has applied (and been approved) for access to a collection of diabetes type I GWAS datasets in several online archives. The company does not want competitors to know that they are working on this particular disease (NB the same would be true for many academic researchers as well). In general, we can conclude that information on a researchers' data access permits constitutes private information that most users will want to keep private (akin to one's address book), and should not be shared without the user's approval.

Relating this to the access mechanism proposed in a previous section, we can identify a key set of requirements: A) the user needs to log onto one website (the data provider) and this service needs to be able to securely request information from another site (the researcher / data permit service) on the user's behalf (see Figure 1b). The good news is that these requirements can be fulfilled with existing technologies.

Authentication vs authorization

We have already introduced OpenID as a candidate for solving the authentication part of this equation (proving who you are). Put another way, OpenID works sort of like a set of master keys that will open the doors to your house, your car etc. You would never give these keys to just any person on the street and trust them to not do anything inappropriate (like going to your house and stealing your television!). For the same reason, you should never give out your OpenID credentials to anyone.

The other part of the equation is authorization: in the scenario above we need some way for the authenticated user to explicitly permit the data provider to use his OpenID credentials to connect to the researcher registry and retrieve a particular piece of the user's private information. It is possible combine OpenID (or some other authentication protocol) with its companion open network protocol named OAuth, in order to control permissions for web services in a fine-grained manner. OAuth is often likened to a special valet key for luxury sports cars, with which the parking attendant can only drive the car a short distance around the parking lot. Another analogy is giving somebody the keys to your house, with the restriction that this somebody can only enter on a Saturday afternoon and only to watch the football game on your television. (perhaps better explained here)

If you are interested in seeing how this works hands-on, see this site for a simple demo which retrieves the contact list from your Google account. A real-life example (albeit using proprietery technology, rather than OpenID+OAuth) is provided by the Facebook social networking website: If you are a Facebook user, chances are you are already using authentication/authorization technologies to share your private data with certain Facebook applications, or connecting your Facebook account with external services, such as the Flickr photo-sharing site.

Case study: microcredit tracking

When it comes to tracking of database submissions, curation and other scientific contributions, one can imagine a researcher initially associating his OpenID with a microcredit tracker service. Whenever the researcher submits data to a biological repository (logged in via his OpenID), the repository submission service contacts the tracker service (securely, via OAuth) and transmits an indicator of the contribution (an authenticated token-based mechanism has been suggested[fn]Bourne et al. I am not a scientist, I am a number. PLoS Comput Biol (2008) vol. 4 (12)doi:10.1371/journal.pcbi.1000247[/fn]). The same could be done for data curation efforts, wiki editing and many other kinds of contributions; the tracker mechanism could be made completely generic.

Over time, the tracker (or possibly multiple trackers) would therefore aggregate submission credit information for the researcher, and the researcher may choose to make this aggregated information public (though some may not want to).

Aggregating information to populate a professional profile

Given that a future researcher has, through his online activites, accumulated various kinds of information at many different locations but all connected via his online identity. How can he aggregate this information and put it to use? One realistic objective is to create a professional profile (e.g. for job applications) which, among other things, would list scholarly publications and other contributions. It is important to many professionals to maintain such a profile online, often via dedicated websites such as LinkedIn or on Facebook and similar social networking websites.

One can easily imagine an extension to LinkedIn which lets a user configure his profile to include a list of published papers retrieved from, and verified by, a central CrossReg service (see previous section), as well as a summary of verified database submissions fetched a microcredit tracker service.

Conclusions

Any system which enables detailed tracking of individuals’ activities, whether online or in the real world, brings with it the potential for invasion of privacy by governmental agencies and other parties. These ‘Big Brother’ concerns are valid and need to be addressed. But researchers cannot expect to have their cake (anonymity) and eat it too (accurate publication record, microattribution etc.). As pointed out in a recent report[fn]Wolinsky. What's in a name?. EMBO Rep (2008) vol. 9 (12) doi:10.1038/embor.2008.217[/fn] there is “a careful balance to be struck between giving credit where credit is due and knowing everything about everyone”.

Nevertheless, a system such as outlined above, where the individual is in the driving seat and controls his online identity and how/where it is used, would go a long way towards addressing these privacy concerns and will be an important aspect of how science is conducted in the future.