Researcher Identification Primer

A number of ostensibly separate initiatives, with diverse objectives, have begun considering the risks, benefits, and practicalities of unambiguously identifying researchers as they use and contribute to biomedical data sources on the Internet. The GEN2PHEN project is one such initiative, given its general aim of helping to unify human and model organism genetic variation databases towards increasingly holistic views into Genotype-To-Phenotype (G2P) data. More specifically, the GEN2PHEN project considers researcher identification to be an absolutely central part of how biomedical databasing, and scientific reporting in general, needs to be developed.

At the heart of this lies the concept of a user-centric system for researcher identification – in simple terms, one or more ‘ID systems’ by which individuals can be unambiguously identified along with various types of information that is associated with them, and where the individual controls his/her online identity and how/where it is used. Key examples of research-related activities and services that would benefit from such a system would include:

At present, key Web 2.0 Internet technologies which can underpin such a system (e.g., OpenID - a decentralized, open authentication protocol, further described on this page), are being widely adopted. To advance this field, a community of key stakeholders (e.g., GEN2PHEN, P3G, HUGO, HVP) has been assembled and is continually growing. This group is exploring innovative ways to exploit this new Internet ecosystem to support research-related activities and services.

Please feel free to peruse the information in this primer, presented as a series of mini-essays listed on the menu to the left. Hopefully this information will help you become more aware of allied projects in the field, and the latest relevant technologies.

 


Note: These pages are an expanded version of the original whitepaper, initially put together by members of the GEN2PHEN project and circulated via E-mail in February-March 2009. We hope the ideas and issues raised stimulate debate and we would welcome contributions/corrections. Feel free to contact us directly, or participate in the discussion by posting comments on these pages, or joining our discussion forums.

OpenID as common authentication system

This part of our web series discusses the merits of a user-centric system for individual identification on the Internet and the role OpenID and related network protocols have to play in this regard.

 

User-centric, bottom-up vs top-down

One strategy for researcher identification is a top-down approach, whereby each researcher is unilaterally assigned an identifier and this would subsequently be used wherever information relating to the researcher needs to be tracked or linked. Arguably, however, the idea of pushing a ‘single-identifier-everywhere’ solution on to researchers will be difficult to set up and operate, and meet with considerable pushback due to concerns about liberty and free will (see more below). Instead, more attractive would be a user-centric pull system, in which each individual seeks out the ID(s) they wish to utilize, and establishes their own linking to other information as and when needed.

Lessons from social networking

This ‘pull’ situation is highly analogous to recent developments in the online social community. At popular networking websites such as Facebook and Flickr, various Web 2.0 services (such as personal blogging platforms) are increasingly being linked together seamlessly to enhance the user experience. A key component in many of these developments is a relatively new technology called OpenID - a decentralized, open authentication protocol backed by Google, Yahoo, Microsoft and numerous other Internet heavyweights. Among recent OpenID-supporters is Facebook who already operate their own proprietary authentication system which is widely used (but see more here on potential synergy between the two systems). 

OpenID provides a way for individuals to identify themselves uniquely across the Internet with a single set of credentials with a provider of their choice, thus avoiding the pain of managing multiple usernames and passwords across a plethora of different websites. OpenID is rapidly gaining ground in the wider online community, and as recently suggested in a publication in PLoS1 it would be possible to use the same system for researcher identification. This proposal has much to merit it, though other options need to be considered (in particular organization-based Shibboleth identities) and there may even be a case for devising a completely new system specifically for biomedical researchers. Whichever system(s) come to be used, however, it is important to realize that individual sub-domains of biomedical research (e.g., journal publishing, funding organisations) will very often wish to employ their own set of individual IDs. This in no way conflicts with the principle of researchers having a universal OpenID, as this would be matched ‘behind the scenes’ to the alternative IDs used publishers, and funders, etc (see more on this page).

More generally, whichever ID and authentication system is used, there are many reasons to make it user-centric, so that: a) the individual is made able to manage his own online identity, and b) the individual has principal control over where his identifier and online profile(s) is deployed and who has access to what sections of it. At present OpenID and companion protocols fits these requirements very well, and so for the remainder of this series we will provisionally assume that OpenID represents the preferred authentication system of choice. However, bear in mind that the usage scenarios described in the sections to follow merely depend on some common mechanism for identification, and not on the use of the OpenID protocol per se.

How does OpenID work?

In brief, the concept of OpenID is that user authentication (i.e. the user proving that he is who she claims to be) is delegated to a third party, the OpenID Provider (OP), instead of taking place at the originating website (e.g. a blogging website) which is called a Relying Party (RP). Put another way, the originating website doesn't ask the user logging via OpenID for proof of his identity, but instead asks the provider of that OpenID "is this person who he says he is?". The user can go to any number of other websites which support OpenID and the process is repeated. In fact, if the user is already logged into the OP site in the same web browser session, he is authenticated straight away (this is called single-signon, or SSO).

The following sections on this page discuss key aspects of OpenID relating to researcher identification. More details on OpenID can found on the OpenID website, on http://openidexplained.com. OpenID proponent Simon Willison also has an excellent Google Tech Talks presentation on this topic.

Freedom to choose OpenID provider

An important feature of OpenID is that there is no single provider that everyone must to use (as was the case with Microsoft's proprietary Passport service (now rebranded as Live ID) several years ago. OpenID is essentially decentralized, with many, many different OPs to choose from. In fact, millions of Internet users already have an OpenID without knowing it, because Google, Yahoo and other services have recently started providing OpenIDs for their users, and users of these sites can therefore log into tens of thousands of websites which support OpenID already.

It is worth nothing that given the wide range of available OPs offering different levels of service (including, crucially, security), control is effectively put into the user’s hands to manage his online identity. For example, a user may not be satisfied with the traditional username/password credentials offered by, say,  Google or Yahoo that he may have already. The user is then free to choose another provider offering more secure hardware-based authentication solutions (e.g. smart cards, one-time pass key via mobile phone text message), such as those offered by VeriSign, MyOpenID, Vidoop and others. Also, if we turn the tables, a RP handling sensitive data may wish to only accept OpenID from providers with more secure authentication schemes. An example of this is Microsoft’s Health Vault service which only accepts three OpenID providers (at the time of writing).

One identity, multiple personal profiles

When registering for an account on a website, typically the user is asked to provide E-mail address, nickname and personal information. On websites supporting OpenID, during the authentication request the website will (via the attribute exchange part of the protocol) often ask the OpenID provider for this information from the user's profile, and the user can approve this request if desired. Most OPs give users the option of creating multiple personal profiles, or personae (e.g. 'work', 'personal'), and then choose the appropriate one upon registration with a new service. Then again, some people will prefer to have entirely separate OpenIDs for different purposes. There's nothing wrong with this; one can have as many OpenID's as desired.

Highlighting the different levels of OP service mentioned above, Google does not currently offer a way to manage OpenID profile and only exposes a user's E-mail address. Users who want to make use of profiles would therefore have to sign up with a different OpenID provider, such as MyOpenID which do offer this service.

OpenID and security

There have been numerous criticisms regarding security in OpenID. One common concern is that with a single username/password, the user now has all his "eggs in one basket" and if that one set of credentials are compromised then an unscrupulous hacker has access to all your user accounts. This is true, but given that most people use the same username/password for many different websites already, existing methods are no better. In fact, since you have a single place where you authenticate, this one place can be made more secure by choosing a reliable OpenID provider, whereas with multiple sites hosting your credentials you have no control over how secure your user account is. Additionally, OPs usually have some way for the user to audit the list of websites he has used his OpenID with, and the user can if needed remove an untrustworty website from this list of trusted sites.

Another concern is that somebody might take over your the Internet domain that your OpenID is based on (e.g. me.myprovider.com), and thereafter control your identity and possibly posing as you all over the Internet. Again, this is true, but  this can be sidestepped by choosing a quality OpenID provider. And same as above, the existing method where people use the same E-mail address when registering on many websites has essentially the same weakness: the E-mail server domain could be hijacked, and subsequently hijackers can request password-reset E-mails usually offered by websites and thereby easily gain control of your accounts on those websites.

Thirdly, because the OpenID authentication process involves redirecting the user from the original site to the OP site, it is vulnerable to "phishing", or man-in-the-middle, attacks. In this scenario, the user is redirected from some less-than-honest website (where he wants to register) to a fake OP website instead of the real provider site. The user types in his username and password, the fake website captures the info and after that the hackers can take over the users identity. This is a serious concern and is not limited to OpenID not at all. In any case, phishing can be countered by simply making sure that the web location (URL) box in the web browser actually contains the right domain name (and not, say, fakeprovider.myopenid.biz). There are also browser extensions (such as VeriSign's OpenID SeatBelt) which help to detect phishing attempts, as well as information cards and related tools.

Conclusions

As discussed on other pages in this primer, there could be big benefits in adopting a common authentication system for researchers to identify themselves on the Internet. OpenID seems like the perfect candidate for this purpose, striking a good balance between ease-of-use and security.

Sensitive datasets, data privacy, and access control

Investigations into clinical materials, especially high-throughput experiments and genetic epidemiology studies using thousands of individuals, generate data from which study participants can be identified. In order to protect these individuals from potential misuse of the data generated about them (e.g. discrimination by health insurance providers or potential employers), the dissemination of these data must be carefully controlled and involves many stakeholders (see e.g. ref. 1). But this will become increasingly costly and difficult to manage on a case by case basis, given increases in; the number of such studies; the number of groups/consortia generating such datasets; the number of databases wishing to integrate and disseminate the information; and the number of researchers wishing to access these data.

Case study: individual-level data from genome-wide association studies

Currently, to gain access to genotype data from genome-wide association studies (GWAS) from the Wellcome Trust Case-Control Consortium (WTCCC), one must complete a special form, wait up to 2 months for approval from the relevant Data Access Committee, and sign a Data Access Agreement. The researcher is then allowed to download encrypted files from the European Genotype Archive (EGA) website to his computer, and must decrypt these files with a provided key. NCBI’s database of Genotypes and Phenotypes (dbGaP) has similar procedures in place.

While there are good reasons for these measures, they already impede the rate of research progress, and will increasingly do so as opportunities for broad dataset integration and meta-analysis become ever more curtailed due to limitations on access. Simply extending the current system will not change the core fact that access permissions must to be applied for per dataset/project, making it very onerous for researchers who need to access many datasets from multiple sources. Also, as the researcher must download the data to his local computer, the system does not scale up to future applications where data integration will take place on-the-fly across many diverse data sources on the Internet. Therefore, even though the primary data in question are in principle available to researchers, the potential from data reuse (e.g. data mining, secondary analyses) is greatly diminished by current dissemination practices.

Case study: aggregate data from genome-wide association studies

Aggregate representations of individual genotype information (genotype/allele frequencies, aka genotype summaries) from GWAS's were until recently distributed without restrictions, based on the assumption that this level of detail does not enable re-identification of individuals within the group. This enabled secondary data providers, or portals (such as HGVbaseG2P) to collect these data and present to end users via special-purpose genome views and search modalities, thus adding value to the original results.

However, in a recent paper2 the authors show that given a high-density genetic profile for an individual, it is possible to work out whether the individual participated in a genetic association study, even if only aggregate genotype data are available from the study. As a result of these findings, various data providers and funders have effectively halted unrestricted sharing of aggregate data (see e.g. response from NIH3) and now full individual-level data access privileges are required to get to the aggregate data. As a consequence of this, secondary data providers now cannot present or re-distribute aggregate GWAS data without greatly restricting the amount of information shown at one time, severely limiting the value such projects could otherwise add.

A registry for users of biomedical data

The whole process would obviously be greatly streamlined if one or more services (probably operated by major regional data centres such as WTSI and NCBI) were to store information on access privileges for each researcher based on an OpenID that he would provide upon registration. The registry (or registries) could then be used by various primary and secondary data providers (whether or not part of the WTCC/SI and NCBI) to check whether or not a person should be allowed access to a given type of sensitive dataset. The same registry could also be used to ‘blacklist’ individuals found guilty of inappropriate use of data (though the complex issue of sanctions needs much further consideration, whatever mechanism for access approval is in operation).

Granularity of data access permissions

The registry and participating data providers could have different levels of granularity for access permissions. For example, in the simplest scenario a person who is listed in the registry (thereby confirming his status as a researcher) could be given 'blanket' access to quasi-sensitive data (such as aggregate genotypes, as outlined in the case study above). System(s) enabling this could be developed relatively quickly, and thus this strategy may serve as an interim solution to the acute aggregate data sharing problem.

In a more complex scenario involving individual-level data, a researcher could be granted access to all datasets from a particular archive (e.g. dbGaP), or all data from a particular consortium which has submitted several datasets to one or more archives (e.g. all WTCCC data). Finally, a researcher could be given access to only a particular dataset (e.g. WTCCC bipolar study).

Conclusions

Current practices for disseminating sensitive biomedical data are onerous and will not scale to support future large-scale data integration tasks. An online registry or registries of researchers and their data access privileges will be key components in streamlining this process, both for acquiring data access permits initially and for accessing the data from primary as well as secondary data providers. Such a framework will require researchers to prove that they are who they say they are in a robust way across multiple websites, which in turn requires the adoption of a universal authentication system (as described in this section of our series).

Author names and authorship in scientific publications

Ambiguity in author names for  scholarly publications has long been a problem in science. Several commercial and non-commercial efforts have been launched to counter this problem, where authors are assigned unique identifiers and ambiguities are resolved, in part by enlisting the help of authors themselves. However, the value of these services will be limited unless they can be linked with one another and other services via a common authentication system.

Name confusion: who published what?

Ambiguity in author names for scholarly publications has long been a problem in science. Multiple authors can have the same name and authors sometimes change their name (e.g. women marrying and taking their husband’s family name). This can in result in inaccurate literature searches, the wrong person being to be asked to peer-review a paper, and a host of other problems. This is particularly pronounced for non-English authors from countries such as India or China where a large number of individuals share the same family name, a situation made worse when different names end up being spelt in different ways when converted into English1.

Existing initiatives

Unique author identifiers have been suggested to resolve this problem2, and two commercial services by major publishers, ResearcherID by Thomson-Reuters (see also comment in The Lancet3) and Scopus Author Identifier by Elsevier, are attempts at doing just that. The non-profit CrossRef organization, which runs the DOI cross-publisher citation linking system, is also working on a system for contributor identifers, provisionally named CrossReg (G. Bilder, personal communication). Whether run by a single organization or multiple organizations/companies, an open contributor identifier service or multiple linked services (hereafter referred to as simply CrossReg, for convenience) would be valuable on many levels in scientific publishing, just as DOIs have done for the publications themselves.

Contributor identifier service + OpenID

So how may authors benefit from such a system? Apart from the obvious advantage of author name disambiguation, one important answer to this concerns centralized author profile management: given that a researcher has registered with CrossReg to claim his profile (and in the process supplied proof that he is who he claims to be), he could then associate his OpenID (see previous section) with his/her contributor ID. This would enable a host of new possibilities, such as logging on to a publisher’s website via OpenID (e.g. in order to submit a manuscript) and allow the publisher to securely retrieve the author’s current affiliation and other profile information from the CrossReg service. See more on this page.

Incentives/rewards for scientific contributions

The traditional way to gauge a researcher’s scientific prowess is to look at his publication record in peer-reviewed journals, and use crude, imperfect metrics like the ISI Impact Factor (IF) as a measure of the quality of these journals. But there are many other ways, besides authoring traditional papers, in which researchers contribute to science. Submissions to biological databases, curation of data in those databases, Web 2.0 activities like scientific blogging, online commenting on and rating of scientific papers (pioneered by PLoS) represent examples of activities for which researchers get little or no credit for at present. Here we will explore how a microattribution scheme providing incentives/rewards could help in this regard and outline some examples.

Microcredit/microattribution

If the various contributions (such as those listed above) can be tracked and linked to the identity of each researcher via his OpenID (see previous section), what would then gradually emerge is a web of publication credit-like (aka ‘microattribution’) information which can be mined and aggregated to produce far more useful metrics of individual scientific contribution than is possible today (see e.g. Scholar Factor as proposed in recent paper in PLoS1). These ideas are being further developed in the guise of a BioMedical Resource Impact Factor (BRIF)2, which is heavily centered on the needs and activities of the Biobanking community.

Submissions to biological databases

Database submissions are often driven by journal and/or funder requirements; i.e. in order to get a manuscript accepted (or research funded), the authors must submit their primary data to the appropriate archives (e.g. DNA sequences to GenBank/EMBL/DDBJ) and then cite submission accession numbers in their manuscript. For some categories of data, mainly DNA sequences and gene expression microarray experiments, this arrangement is now well established and the data are relatively standardized.

But for other kinds of data which have emerged quite recently (e.g. results from genome-wide association studies, see previous section) this is not the case, and papers are published every month where the underlying data are not made available. There can be several reasons for this, but key factors are undoubtedly that  i) journals/funders do not yet require the data to be submitted to archives, and ii) researchers are not rewarded for submitting data.

Construction/maintenance of databases

For many kinds of biological databases, maintenance involves manual curation of the data contents. Biocurators verify data correctness (e.g. automated gene structure predictions), enhance data by adding related information (e.g. from literature mining) or cross-reference with other databases, and so on. Such work goes largely unnoticed as the output cannot usually be measured in journal publications, prompting calls for a robust infrastructure for recognition of curation work. This will be particularly important to facilitate community curation of large-scale datasets which are too large to be tackled by curator teams3.

Additionally, funding for construction, and particularly long-term maintenance, of biological databases is hard to secure4. It would be valuable for database maintainers if they could unequivocally show reviewers how useful their data content is to the community, by way of accurate citation metrics for datasets (possibly right down to the level of individual database records). This would be a far more useful metric of the scientific value of a database than simple website traffic statistics and/or citations to the paper describing the database as a whole).

Conclusions

Constructing a microcredit-tracking system or systems for incentives/rewards is entirely feasible technically, but success is heavily dependent on researchers being able to identify themselves uniquely (and securely) to the various online services involved. Furthermore, a key aspect of this is that the user should always have the choice whether or not to use his primary public identity for these activities (i.e. in situations where anonymity is preferred). The last section in this primer will discuss some scenarios where a user-centric system (as previously presented) can underpin a loosely-connected network of services which can go a long way towards achieving this goal.

Tying it all together

In previous sections of this primer we introduced several aspects of the identification problem and outlined scenarios where a universal authentication system forms an integral part of the solution (see Figure 1 below). Now it is time to investigate how this may work in practice, which technologies (in addition to OpenID) might play a role, and how a researcher might leverage these tools to aggregate information about himself in a meaningful way.

Tying it all together

Case study: data access control 

Imagine a researcher working for company X who has applied (and been approved) for access to a collection of diabetes type I GWAS datasets in several online archives.  The company does not want competitors to know that they are working on this particular disease (NB the same would be true for many academic researchers as well). In general, we can conclude that information on a researchers' data access permits constitutes private information that most users will want to keep private (akin to one's address book), and should not be shared without the user's approval. 

Relating this to the access mechanism proposed in a previous section, we can identify a key set of requirements: A) the user needs to log onto one website (the data provider) and this service needs to be able to securely request information from another site (the researcher / data permit service) on the user's behalf (see Figure 1b). The good news is that these requirements can be fulfilled with existing technologies.

Authentication vs authorization

We have already introduced OpenID as a candidate for solving the authentication part of this equation (proving who you are). Put another way, OpenID works sort of like a set of master keys that will open the doors to your house, your car etc. You would never give these keys to just any person on the street and trust them to not do anything inappropriate (like going to your house and stealing your television!). For the same reason, you should never give out your OpenID credentials to anyone.

The other part of the equation is authorization: in the scenario above we need some way for the authenticated user to explicitly permit the data provider to use his OpenID credentials to connect to the researcher registry and retrieve a particular piece of the user's private information. It is possible combine OpenID (or some other authentication protocol) with its companion open network protocol named OAuth, in order to control permissions for web services in a fine-grained manner. OAuth is often likened to a special valet key for luxury sports cars, with which the parking attendant can only drive the car a short distance around the parking lot. Another analogy is giving somebody the keys to your house, with the restriction that this somebody can only enter on a Saturday afternoon and only to watch the football game on your television. (perhaps better explained here)

If you are interested in seeing how this works hands-on, see this site for a simple demo which retrieves the contact list from your Google account. A real-life example (albeit using proprietery technology, rather than OpenID+OAuth) is provided by the Facebook social networking website: If you are a Facebook user, chances are you are already using authentication/authorization technologies to share your private data with certain Facebook applications, or connecting your Facebook account with external services, such as the Flickr photo-sharing site.

Case study: microcredit tracking

When it comes to tracking of database submissions, curation and other scientific contributions, one can imagine a researcher initially associating his OpenID with a microcredit tracker service. Whenever the researcher submits data to a biological repository (logged in via his OpenID), the repository submission service contacts the tracker service (securely, via OAuth) and transmits an indicator of the contribution (an authenticated token-based mechanism has been suggested1). The same could be done for data curation efforts, wiki editing and many other kinds of contributions; the tracker mechanism could be made completely generic.

Over time, the tracker (or possibly multiple trackers) would therefore aggregate submission credit information for the researcher, and the researcher may choose to make this aggregated information public (though some may not want to).

Aggregating information to populate a professional profile

Given that a future researcher has, through his online activites, accumulated various kinds of information at many different locations but all connected via his online identity. How can he aggregate this information and put it to use? One realistic objective is to create a professional profile (e.g. for job applications) which, among other things, would list scholarly publications and other contributions. It is important to many professionals to maintain such a profile online, often via dedicated websites such as LinkedIn or on Facebook and similar social networking websites.

One can easily imagine an extension to LinkedIn which lets a user configure his profile to include a list of published papers retrieved from, and verified by, a central CrossReg service (see previous section), as well as a summary of verified database submissions fetched a microcredit tracker service.

Conclusions

Any system which enables detailed tracking of individuals’ activities, whether online or in the real world, brings with it the potential for invasion of privacy by governmental agencies and other parties. These ‘Big Brother’ concerns are valid and need to be addressed. But researchers cannot expect to have their cake (anonymity) and eat it too (accurate publication record, microattribution etc.). As pointed out in a recent report2 there is “a careful balance to be struck between giving credit where credit is due and knowing everything about everyone”.

Nevertheless, a system such as outlined above, where the individual is in the driving seat and controls his online identity and how/where it is used, would go a long way towards addressing these privacy concerns and will be an important aspect of how science is conducted in the future.