Atom web feeds, the AtomPub protocol and G2P databases
| Contributed by: | Gudmundur A Thorisson |
| Originally posted: | 5th October 2009: 10:31 am |
| Last updated: | 10th February 2010: 12:40 pm |
| Short URL: | http://gen2phen.org/node/7142 |
This post, something of a sequel to my previous post on REST vs SOAP, discusses how web feeds (aka RSS or 'syndicated' feeds) and related Web technologies can potentially be used for a variety of G2P databasing tasks, both to enhance the overall Web 2.0 user experience and for 'lightweight' federated database querying and data exchange.
What is a web feed?
Web feeds are commonly used on websites where content is frequently updated, such as blogs or news websites. A feed is a simple XML web document describing a list or collection of items, typically recent blog entries or news headlines. Web users can subscribe to new content from sites of interest by adding the web location, or URL, of site feed documents to a so-called feed reader (e.g. Google Reader or similar Web tool, or a standalone application like FireFox Live Bookmarks). The feed reader routinely checks or 'polls' the subscribed feeds for updated items, and so enables the user to easily monitor a large number of websites for new content without having to visit each site regularly.
Web feeds have in the last few years become widely used and are a key feature of many Web 2.0 applications. A large portion of the Internet user community now uses feeds daily, not the least because feeds are conceptually pretty simple for users to grasp and straightforward to use, yet can be immensely useful. For scientists, feeds can be particularly valuable as means to keeping up to date on the scholarly literature; for instance, PubMed lets you save a search as an RSS feed URL, and HubMed service does the same; here's one of my subscribed feeds.
Beyond the obvious advantage to Web users as a tool to keep up to date with website content, the other major use case for web feeds is content aggregation, meaning websites wanting to collect updated content or content summaries from many other sites and re-publish. As an example of this, our very own GEN2PHEN Knowledge Centre website aggregates feeds from a number of journal websites, from which the entries our editor finds most relevant to the G2P domain are selected for re-publishing in the News section of the website.
The anatomy of a feed
A feed XML-document comprises a channel which represents the information source the user has subscribed to, and list of entries in the channel (news item, blog post, photo etc.). Each entry typically has a title, a description or summary field and a hyperlink back to the originating site, as well as several other pieces of metadata (author, date updated etc.). There are two main XML-dialects for representing this conceptual model as web documents, and most feed readers support both. RSS (Real Simple Syndication) is the older and syntactically-simpler of the two and somewhat more widely supported. But for various reasons (including personality clashes, apparently, see more here), recently Atom has emerged as a more sophisticated and extensible syndication format. The figure depicts the Atom conceptual model, several XML examples are provided below.

For the purpose of this post, the main distinction between RSS and Atom is that RSS is very restrictive when it comes to extending the format; the main limitation is that each feed entry can only contain plain text (or escaped HTML). This severely limits what can be put in an RSS feed entry. Atom on the other hand is fully extensible and allows embedding arbitrary data as entry 'payload', including but not limited to structured XML from imported XML schemas (as further discussed below), or a URI reference to content located elsewhere (see figure 1). For this reason, the rest of my post will therefore refer to Atom XML as the feed transport format unless not otherwise stated.
Web feeds for G2P database
The potential usefulness of web feeds goes far beyond just blogs and news websites. Any web content that can usefully be organised as a list of items (e.g. variants, genes, search results) on a web page can be alternatively represented as a web feed, enabling website users can subscribe to the feed just like they would any other feed to monitor for new content. A straightforward example of the use of feeds for this purpose the listing of new entries in the NHGRI GWAS catalog, albeit as an RSS feed:
http://feeds.feedburner.com/NhgriGWASCatalogAdditions
<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0">
<channel>
<title>NHGRI GWAS Catalog Updates</title>
<link>http://www.genome.gov/gwastudies</link>
<description>A Catalog of Published Genome-Wide Association Studies</description>
<language>en-us</language>
<lastBuildDate>Fri, 2 Oct 2009 11:22:06 AM EST</lastBuildDate>
<atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="self" href="http://feeds.feedburner.com/NhgriGwasCatalogAdditions" type="application/rss+xml" /><atom10:link xmlns:atom10="http://www.w3.org/2005/Atom" rel="hub" href="http://pubsubhubbub.appspot.com" /><item>
<title>Study Added: Cognitive performance</title>
<link>http://www.genome.gov/gwastudies/#5580</link>
<guid>http://www.genome.gov/gwastudies/#5580</guid>
<description>
AUTHOR: Need
STUDY DATE: September 4, 2009
TITLE: A Genome-wide Study of Common SNPs and CNVs in Cognitive Performance in the CANTAB battery
</description>
<pubDate>Fri, 2 Oct 2009 11:18:24 AM EST</pubDate></item>
</channel>
Amongst GEN2PHEN partners, this feed from the test LOVD LSDB-in-a-box instance for muscular dystrophy was recently created by the Leiden group and lists new and updated variants as the database content changes over time:
http://www.dmd.nl/nmdb2/api/feed.php
A more advanced example is representing a list of search results as a feed: a search for the keyword 'hexokinase' in the UniProt protein knowledgebase can be alternatively displayed as an RSS feed via this URL:
http://www.uniprot.org/uniprot/?query=hexokinase&format=rss
Returning to LOVD, Ivo very recently implemented Atom feed representations of several views of LOVD database content as described is his post, such as this URL for listing all variants in a particular gene:
https://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA
Feed aggregation and distributed search
As mentioned above, web feeds lend themselves to aggregation and re-publishing. The Knowledge Centre news feed scenario above, for instance, can easily be extended to database feeds. The KC website can display a list of new GWAS publications as they are included in the NHGRI catalog, right next to a master feed showing new/updated LSDB entries across many LSDBs. Again, the emphasis is on not on duplicating content from the original database, but rather on announcing the presence of database entries widely, ultimately leading to increased traffic back to the original website where the full details can be retrieved.
Stepping up the level of complexity, let us now consider distributed search. In the example above, UniProt usefully expose their search functionality as a standard OpenSearch document. The OpenSearch standard furthermore specifies how search results can be represented as Atom or RSS feeds with several OpenSearch-specific extensions. UniProt uses these extensions in their RSS feed above to indicate e.g. how many total results were return from the search and results paging.
A key feature of OpenSearch is that it enables a single web portal (the search aggregator) to query, in a relatively simple way, way across many OpenSearch providers and present the unified search results to the user. This functionality is in fact built into the Drupal CMS used to build the Knowledge Centre, and so our group can with little effort 'turn on' this search functionality to enable KC users to search across all LSDBs, or all LSDBs focusing on, say, skeletal disorders, or several GWAS databases, and more.
Beyond plain-vanilla metadata feeds: enhanced Atom 'data feeds'
The scenarios I have listed so far all involve using Atom XML to carry metadata describing database entries, with an underlying assumption that feed consumers (human, or software agent) will follow the link back to the website of origin to retrieve the actual data. This by itself can be very useful for a wide range of tasks, as the examples above show. But information provided in a feed entry need not be limited only metadata. If the Atom feed format is used (rather than RSS), the Atom XML can be extended to carry arbitrary XML data as 'payload' inside a standard Atom entry 'envelope'. Atom XML can thus serve as a blueprint for machine-readable 'data feeds' suitable for channeling data from one database into another, or aggregating data from many databases.
For instance, the LSDB example above could be extended so that each feed entry contained not just variant metadata & summary, but a fully-fledged, structured XML representation. This Atom XML example combines Ivo's LOVD metadata feed with the example variant XML from Glen Dobson's recent post:
<?xml version="1.0" encoding="ISO-8859-1"?> <feed xmlns="http://www.w3.org/2005/Atom"> <title>Listing of all public variants in the CRYAA gene database</title> <link rel="alternate" type="text/html" href="https://chromium.liacs.nl/LOVDv.2.0-dev/"/> <link rel="self" type="application/atom+xml" href="https://chromium.liacs.nl/LOVDv.2.0-dev/api/rest.php/variants/CRYAA"/> <updated>2007-06-21T17:23:00+02:00</updated> <id>tag:chromium.liacs.nl,2006-11-21:Chr:LOVDv.2.0-dev/REST_api</id> <generator uri="http://www.LOVD.nl/" version="2.0-21d">Leiden Open Variation Database</generator> <rights>Copyright (c), the curators of this database</rights> <entry> <title>CRYAA:c.27G>T</title> <link rel="alternate" type="text/html" href="https://chromium.liacs.nl/LOVDv.2.0-dev/variants.php?select_db=CRYAA&action=search_unique&search_Variant%2FDBID=CRYAA_00001"/> [....]<published>1970-01-01T00:00:00+01:00</published> <updated>1970-01-01T00:00:00+01:00</updated> <content type="application/xhtml+xml"> <variant> <id>UBE3A_00001</id> <exon> 08</exon> <dna_change>c.3_16del14</dna_change> <protein_change>Frame shift (predicted)</protein_change> <patient> <pathogenicity> <reported>Probably pathogenic</reported> <concluded>Probably pathogenic</concluded> <short>+?/+?</short> </pathogenicity> <id>003199(MC)</id> <disease>Angelman syndrome</disease> … </patient> </variant> </content> </entry> [.. many more entries] </feed>
Atom-powered content - the Atom Publishing Protocol
As I have described above, the potential usefulness of publishing lists of database entries and entry metadata as a simple web feed is substantial, and we are already leveraging this in a number of G2P applications. But the Atom syndication format is one-half of a pair of standards, the other of which is the Atom Publishing Protocol, also known as APP or AtomPub. AtomPub is a RESTful web service protocol for publishing and editing Web resources. AtomPub is a generic framework for managing all sorts of information on the Web and is agnostic with respect to the type of content carried, and the Atom XML format plays a key role here as a content-neutral data wrapper. In the broader Web 2.0 online community, AtomPub is being adopted widely as a generic data publishing framework, notably as the foundation for major online services such as the Google Data APIs and even Microsoft's Live Platform.
The elegance of applying Atom XML/AtomPub to problems in the G2P databasing domain is that Atom feeds and entries can be created, edited, consumed, processed, filtered and redistributed by a plethora of available AtomPub software tools and frameworks. These generic tools need not know anything about the custom G2P data payloads within the feed. Or, alternatively, such tools and frameworks can serve as a foundation for more specialised G2P-specific tools. An example of a G2P database project attempting do just this is the Café Rouge mutation exchange depot being developed in our group in Leicester with collaborators in GEN2PHEN. Cafe Rouge is implemented as an Atom store using open-source AtomServer platform. Analysis software used in diagnostic labs will enable operators to transmit mutation reports to the Cafe using a standard HTTP request against the standard AtomPub API, with mutation data 'wrapped' into an Atom entry envelope as described above. Mutation reports will be advertised as Atom feeds which will subsequently be republished via 3rd party websites (such as the Knowledge Centre), enabling LSDB curators and other parties to discover and retrieve the information.

Conclusions
I have described the concept of web feeds and their utility for representing lists of items in a simple, standard way. The extensible Atom feed syndication format has great potential in the context of aggregations of entries contained within G2P databases. The examples provided demonstrate how metadata feeds are currently being used by GEN2PHEN partners to enhance LSDB in-a-box software. More applications are forthcoming that leverage Atom feeds, such as enhancements to the HGVBaseG2P website which will enable users to access search results and other content as feeds.
Looking further ahead, Atom feeds and the AtomPub protocol offers tantalising potential as a generic framework for publishing data on the Web and for 'funnelling' data from one place to another. Developers can leverage well-defined data formats, semantics and protocols for such generic 'data plumbing' tasks, and instead focus their energy on domain-specific data modelling, semantics and processing tasks. The Cafe Rouge pilot project will demonstrate the power of this approach for mutation data exchange and may pave the way for many more Atom-powered G2P applications in the future.
- Login to post comments
- Visit Atom Powered Content, blog post by John Newton

Comments
Comments
#1 Hi Here is an example how the
Hi
Here is an example how the variation info can be coded using the LSDB-XML (http://www.gen2phen.org/wiki/lsdb-xml-data-format). The XML format is patient centric i.e. variant info is embedded into patient, which must always exist. There are scenarios where this may not be ok. Is this the case with the cafe rouge use cases ? We could tune the format accordingly if needed.
<?xml version="1.0" encoding="UTF-8"?>
<patient
id="003199(MC)"
xmlns="http://gen2phen.org/lsdb/1.1"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://gen2phen.org/lsdb/1.1 file:/Users/muilu/Documents/Dev/SVN_repos/gen2phen/trunk/data_formats/xml/lsdb.xsd" >
<!-- gender not know... is this OK? -->
<gender code="0"></gender>
<phenotype>Angelman syndrome</phenotype>
<variant id="UBE3A_00001">
<!-- is sequence type (on which the name is based on) ok -->
<name type="DNA" scheme="HGVS">c.3_16del14</name>
<exon>08</exon>
<pathogenicity>Probably pathogenic <evidence_code>reported</evidence_code></pathogenicity>
<pathogenicity>Probably pathogenic <evidence_code>concluded</evidence_code></pathogenicity>
<!-- short +?/+? ... what is this tag ? -->
</variant>
</patient>