Serving Life-science Information for the Next Generation
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Higher or Secondary Education Establishments
€ 5 210 522,90
Phil Irving (Dr.)
Sort by EU Contribution
SIB INSTITUT SUISSE DE BIOINFORMATIQUE
€ 1 774 655,10
TECHNISCHE UNIVERSITAET BRAUNSCHWEIG
€ 232 554
EUROPEAN PATENT ORGANISATION
€ 1 500 133
€ 82 104
Grant agreement ID: 226073
1 March 2009
31 August 2012
€ 10 834 375,20
€ 8 799 969
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Optimised use of biomolecular information
Grant agreement ID: 226073
1 March 2009
31 August 2012
€ 10 834 375,20
€ 8 799 969
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Final Report Summary - SLING (Serving Life-science Information for the Next Generation)
The SLING project has supported the European exploitation of biomolecular information in three ways:
1) Providing high-quality electronic services allowing access to comprehensive, accurate and up-to-date information
SLING funding has contributed to the EBI effort in collecting, organizing and making available a wide range of biomolecular data, and providing services allowing scientists to exploit those data (including genes and genomes, protein sequences and structures, molecular behaviour and interactions). During SLING, EBI web hits have more than doubled; the monthly unique users have increased by over a half; individual EBI resources have genereally seen large increases in access and numbers of users.
SLING funding has contributed to the SIB protein annotation enhancements for UniProtKB/SwissProt (a high-quality curated resource of protein sequences and functional annotation). During SLING, 124,000 curated sequence records were added, and 7,800 corrections were made, 85,000 Gene Ontology manual annotations were added, 724 new complete EC numbers were created, and 100,000 individual PTMs were curated. SLING funded ongoing curation of structural information from publications - 5,000 references, 40,000 cross-references, in 4,000 UniProtKB/Swiss-Prot records.
2) A programme of joint research activities which will enhance these services and develop then in response to changing science
Before SLING there was little or no exploitation of Patent Literature in basic science. SLING has now contributed to significant advances in the availability of full-text resources that support computational approaches to literature/data integration. SLING work has established 1) a full-text article repository (An Open Access article FTP site, plus release of a public web service) and 2) a full-text patent repository (containing 1.3 million full-text patents, plus a basic web service). In addition, the SLING work has informed possible future developments of full-text resources; in the case of journal articles, the value of re-using full-text has been demonstrated by the development of a text-mining based tool that supports curation work.
Before SLING, databases were poorly-adapted to include next-generation sequence data storage and interpretation. After SLING, those data have been collected, integrated and organised through such activities as the development of 1) an epigenomics checklist of minimal submitter information, 2) a meta- and full-data exchange pipeline between NCBI and EBI, 3) submission tools with automated receipt and status reporting, and re-launch of loading jobs, 4) metadata text-search and link to API that allows users to provide text that may appear in the metadata layer and return all of those records where the text is present, 5) an Ensembl ‘Regulatory Build’ analsysis pipeline, and 6) a portal between Ensembl and the sequence archives.
Before SLING, there was patchy information attached to genomes. SLING has now provided the infrastructure, methods, and curatorial expertise to acquire, annotate, integrate and disseminate Next Generation Sequencing based transcriptomics data across the EBI, between SLING partners, in Europe and Internationally. To achieve these aims, SLING funding has contributed to the development of 1) a community-generated and adopted format (MINSEQE); 2) a data-exchange agreement between the world major repositories (USA NCBI Gene Expression Omnibus and the European ArrayExpress); 3) Open Source software supporting the data exchange agreement with conversion scripts, and curator support tools; 4) an Open Source submission tool for submitting NGST data; 5) training courses and resources; 6) an enhanced ArrayExpress GUI with support for search and display of NGST data; 7) the ArrayExpressHTS R package for processing NSGT data for inclusion in the Gene Expression Atlas; 8) a pipeline to link the ENA short read component, ArrayExpress and Ensembl; and 9) an ontology supporting the description of NGST experiments, and aiding curation.
Before SLING, the richest quantitative proteomics information was poorly provided. After SLING, the PRIDE database has now been enhanced and refined to represent quantitative protein expression data, and to provide both tabular download and external analysis capabilities for such data. Work has been done on a simplified data model and representation for information that supports the main quantification approaches. A new Open Source tool (PRIDE Inspector) has been developed for visualizing and performing an initial assessment of the quality of mass-spec proteomics data. The tool is particularly useful for journal editors and reviewers, as well as facilitating connection to the external Ensembl and Reactome tools. Through SLING, initial steps have been taken to develop PRIDE from a mass-spec specialist resource into a much wider protein expression resource for the molecular biology community in Europe.
Before SLING, dynamic interaction networks were scarcely represented in databases. After SLING, the IntAct database has now been enhanced to allow efficient deposition, curation, display, and analysis of dynamic interaction data, along with integration into a systems biology context. To integrate 3rd party data, a standard interface (PSICQUIC), has been developed (and adopted by all 25 major resource providers of 150 million interactions) to query multiple interaction data resources using the same query. Before SLING, the connection between molecular interaction data and supporting mass spectrometry data was difficult to make, because they were located in two unconnected databases: IntAct (molecular interactions), and PRIDE (mass spectrometry). After SLING, there is now a better connection between the two types of data, allowing the IntAct data to directly access the supporting mass-spec evidence in PRIDE via the existing DAS standard.
Before SLING, the standardisation of protein annotation was limiting inter-database connections. After SLING, a set of protein-naming guidelines has now been developed to facilitate inter-database connections. The naming-standard guidelines have been agreed upon and adopted by a number of the major sequencing centres, data providers, and nomenclature committees. The on-going resolution of existing discrepancies (in protein nomenclature between resources) will facilitate further data integration and interpretation by users. In addition, SLING has contributed to the enhancement of annotation standards of UniProtKB binary protein interactions via a tool developed to conform to the IMEx Consortium MIMIx standard. During SLING, 10 curators have used this tool to curate 2,815 MIMIx-level binary interactions from 1,407 experiments described in 409 publications.
Before SLING, enzyme information lacked connectivity and richness. After SLING, the BRENDA enzyme database has now been substantially enhanced for applications in systems biology and medicine, with emphases on 1) completing the manually-annotated data with full sets of enzyme data by text-mining methods; 2) provision of automated access to the manually-annotated data; and 3) the enlargement of the fields covered by BRENDA (including new enzymes). Text-mining methods have been developed to extract kinetic enzyme data from literature abstracts, to add to the existing manually-extracted data. New output functionality has been developed to allow automatic generation of a single SBML file containing the kinetic data of enzyme-catalyzed reactions of an organism. The widely-used BRENDA tissue ontology standard has been expanded by adding branches and nodes to the tree, and compiling new terms including their definitions; 1,345 new single terms and 1,179 new definitions have been added under SLING. BRENDA now gives access to the PDB enzyme 3D structures and to the UniProt enzyme-specific protein sequences. SLING has also enhanced the naming and classification of enzymes by funding the submission of 336 new EC numbers.
Before SLING, there was poor information for chemicals in biology, and a dependency on proprietary code. During SLING, to understand the needs of multiple user communities and to custom-tailor further ChEBI development, a User Survey was commissioned. Consequent to the survey results, the curation tool was enhanced to link to the CiteXplore text-mining infrastructure to text-mine a given citation; along with the increased new citations, the text-mining enabled addition of extra biological and chemical role data to the ChEBI ontology. Before SLING, ChEBI relied on a number of proprietary chemoinformatics modules (for chemical structure searching, display/editing of 2D chemical structure diagrams) which prevented the free dissemination of the ChEBI technology to the scientific community. Under SLING, ChEBI has now, and is being re-engineered to implement Open Source alternatives to all but one of the proprietary libraries, towards the ultimate aim of making ChEBI source code accessible and re-distributable without licensing restrictions. After SLING, the number of monthly unique visitors to the ChEBI website has increased from 15,150 to 24,100; programmatic access to the ChEBI data has increased from a monthly average of 418,000 hits to 3,365,000 hits.
3) Extensive pan-European user-training to facilitate exploitation of the information
Before SLING, the demand for bioinformatics training far exceeded supply. To meet that demand, SLING has now played a crucial role by funding a training programme of 33 Roadshows that have reached c. 1,100 experimental researchers in the molecular life-sciences community throughout Europe, particularly the new EU Member States. The Roadshows have also provided Europe with a lasting legacy of online training courses that have been accessed by 24,000 researchers in the 1st year of operation. In addition, SLING funding has enabled European bioinformatics trainers to meet annually to exchange and develop a comprehensive set of best-practice trainer guidelines.
Project Context and Objectives:
Project summary and context
SLING is concerned with Serving Life-science Information for the Next Generation. Its goal is to make sure that advances in European Science are supported by the best possible biomolecular information, and that European scientists are optimally-equipped to exploit it. To do this, it makes available a comprehensive range of databases and services from the Swiss Institute of Bioinformatics (SIB) and the European Bioinformatics Institute (EBI); it provides exclusive, high-quality training in the use of these databases and resources; and it carries out R&D necessary to enable the data and services to keep pace with changing science. The activities are designed to support both commercial and academic research throughout Europe, and its training will be delivered in numerouse European locations. The work to ensure the quality of the data will include efforts targeted at information in patent literature through the European Patent Office (EPO). New high-throughput methods such as next-gneeration DNA sequencing will provide major stimuli for the R&D work of the project. In addition, the BRENDA enzyme database at the Technical University of Braunschweig (TUBS) will be substantially enhanced for applications in the area of systems biology and medicine (which will also involve Enzymeta).
The main objectives of SLING are to support Europe’s exploitation of biomolecular information in three ways:
• By providing high-quality electronic services allowing access to comprehensive, accurate and up-to-date information
• By a programme of joint research activities which will enhance these services and develop then in response to changing science
• By extensive pan-European user training to facilitate exploitation of the information.
Well established shared collections of biomolecular data are now a key part of the record of the life-sciences, every bit as important as the journals. These databases are now substantial operations, are crucial to the SLING project. The SLING partners are the main European players, and this role is typically discharged as the European partner of a global collaboration which ensures that data are exchanged world-wide, ensuring completeness for European scientists.
Advances in high-throughput methods have increased both the scale and complexity of the task of serving this information. SLING addresses these challenges through:
• Service provision which will allow access to a comprehensive and unified collection of electronic resources of relevance to biology.
• Joint research activities which will enhance those information resources in response to the demands of developments in high-throughput science.
• User training which will be delivered throughout Europe and will optimize European researchers’ ability to exploit the service and create enduring interactions which reinforce pan-European collaboration.
The completion of the SLING project will result in very visible progress:
• Bioinformatics Training - several hundred scientists will benefit from SLING training
• Exploitation of Patent Literature - services and training will exist to render patent literature accessible
• Next Generation Sequence Data - will be a normal part of the public databases
• Epigenomics - will be fully integrated with genome information
• Sequence based transcriptomics - will be a normal part of the public databases
• Quantitative Proteomics - will be a normal part of the public databases
• Molecular interactions - databases will include dynamic information
• Protein annotation - terminological standards will be enhanced, and database connectivity improved
• Enzyme information - increased connections to tissue and structure information; new enzyme classes and kinetic data will be created
• Chemicals in biology - enhanced databases and open source chemoinformatics code will be created
To achieve the project goals (i.e. to provide information and information services, to carry out R&D to ensure that those services are state-of-the-art, and to optimise scientific exploitation of those services through extensive user training), the project embraces four groups of work packages:
WP01 — Overall management of the project
WP02 — Training in use of the resources
WP03 — Dissemination of best practice for patent information
WP04 and WP05 — Serving users (SIB and EBI are the only partners claiming costs for this, though the other partners do serve information)
WP06 to WP14 — Research and development work enhancing the services
These workpackages were originally scheduled to cover a 36-month period (i.e.two 18-month periods); subsequently, this period was increased (by a no-cost extension) to 42 months (one 18-month period, one 14-month period, and one 10-month period).
Description of the main S & T results/foregrounds:
Bioinformatics data resources are only useful if their intended users know how to use them. SLING WP02 therefore has played a crucial role in making the deliverables of the project’s technical workpackage accessible to Europe’s molecular life sciences community.
We have developed and refined a flexible training programme for users of the services delivered under SLING. Termed the Bioinformatics Roadshow, this programme comprised a set of short training modules (http://www.ebi.ac.uk/training/roadshow/modules) which can be combined to meet the needs of different user communities. SLING funding has enabled us to deliver the Roadshow to 33 different host institutes throughout Europe, reaching 1,066 individuals (See attached file 'SLING Roadshows - Trainer & attendee info'). The training is aimed at experimental researchers with no previous bioinformatics training, and incorporates a high proportion (typically >50%) of ‘hands-on’ practical training at the computer.
We collected feedback from our trainees, hosts and trainers, including key performance indicators. In total, 64% of our delegates completed a feedback questionnaire after their Roadshow. Of these, the percentage of delegates who had used SLING resources before their Roadshow ranged from 0% (Sardinia) to 100% (Aosta), with a mean of 52%. This gives us confidence that we have succeeded in reaching research communities that were not previously heavy users of the SLING resources. The percentage of delegates who informed us that they would use the SLING resources after their Roadshow was 94%, showing that our Roadshows effectively promoted the SLING resources to previously SLING-naïve researchers.
(See attached file 'SLING Roadshows - Trainer & attendee info')
Importantly, we targeted institutes that would not have been able to host a Roadshow without the support of SLING, especially in the new EU member states.
The training materials used for these Roadshows have been developed into an online training resource, Train online (www.ebi.ac.uk/training/online). Train online (www.ebi.ac.uk/training/online) is a freely-available online training resource aimed at end-users of SLING services. It builds on the training materials developed by SLING trainers to provide stand-alone courses that can either be used as a complement to face-to-face training, or be accessed completely independently by users in their own time. In its first year of operation, Train online has had 24,000 users.
WP02 also enabled bioinformatics trainers from around Europe to meet on an annual basis and exchange best practice; the outcome of these meetings is a comprehensive set of best practice guidelines for bioinformatics trainers. ‘Bioinformatics training for life scientists: guidelines for best practice’, with contributions from more than 20 people, derives from the shared experiences of a group that has met several times over the last three years through the SLING trainer networking sessions. This document is aimed at anyone delivering or organising bioinformatics training for life scientists. Its goal is to make it easier for them to deliver excellent training, and thus to expedite research in the life sciences and bioimedical sciences.
Several training collaborations also came out of the trainer networking , including a text book, several papers, and the development and delivery of new training courses. Finally, the SLING partners produced and maintained extensive on-line documentation, often based on enhancements to their existing documentation.
Now that SLING has drawn to an end, although we do not have funds to run a comprehensive programme of roadshows throughout Europe, we will continue to train our users throughout the world, on a host-funded basis (see www.ebi.ac.uk/training/roadshows for a list of scheduled roadshows). The SLING Roadshow programme has provided us with a lasting legacy in the shape of online courses delivered through Train online. Furthermore, it has helped us to develop a network that will help to shape the ELIXIR training programme in the future.
PATENT DATA COLLECTION AND DISSEMINATION
The European Patent Office (EPO, http://www.epo.org) offers inventors a uniform application procedure which enables them to seek patent protection in up to 40 European countries. Supervised by the Administrative Council, the Office is the executive arm of the European Patent Organisation. The main task of the European Patent Office is to examine patent applications and to grant European patents. Besides the granting procedure, disclosure of the invention and subsequent publication are fundamental to the European patent system.
A number of patents disclose Nucleic acid and protein sequences. These biological sequences are unusual within patents in that they contain information presented in a structured manner, compared to the other unstructured parts of a patent application. Initiated under the EU FELICS FP6 programme and being continued under the FP7 SLING project, the EPO pursued developments to improve both quality and quantity of available sequence information. In collaboration with the EBI, progress has been made towards extraction of chemical information and cross-referencing to scientific literature. Those projects are part of WP03 and WP13.
WP03 has the following three objectives:
i. The dissemination of IP awareness in Biotechnology, more specifically, the availability of free sources of information around patents in life sciences and beyond. The priority is to target countries with low awareness, and low usage of the databases and services covered by SLING
ii. The development and the promotion of a biological sequence submission tool (BiSSAP)
iii. To populate biological sequence databases containing sequences disclosed in patents
1. Dissemination activities
The EPO has organized 34 dissemination activities:
• 27 EPO-SLING awareness events have been held, covering a core of presentations on the SLING Project, Lisbon Agenda, free patent sources of information Biotechnology, Patent Sequence Databases, BiSSAP, and EBI Services.
Some events were also attended by other SLING partners (SIB and Enzymeta) to present the SwissProt and BRENDA databases. Finally, talks were provided - talks relevant to the particular country hosting the event, or specialized topics upon demand of the hosts. To facilitate the organization and availability of material presented, the registrations and presentations are posted on a dedicated web site (http://www.sling-diffusion.eu). According to the replies to our questionnaires, impact and added value of these events reached the target (overall score around 8/10). See file 'Patent Data Collection and Dissemination - S&T Figures' Figure 1.
• five EPO Online Services User Days have been held in Madrid. During this event, we had a workshop dedicated to BiSSAP. The audience consisted mainly of Patent Attorneys who may use this software to provide biological sequence data according to EPO requirements.
• two Poster sessions for the presentation of the non-redundant patent sequence database and BiSSAP
• only one event had to be cancelled at the last moment
The EPO made its best effort to recover the initial project delays, and in reply to EU's request focused on 'new member states', limited itself to European countries.
A typical dissemination event consisted of dedicated presentations and workshops (Questions and answers). Each event consisted of talks around a core of presentations focusing on:
• Patent awareness in the country visited, often with a talk given by a representative of the nation's patent office
• free scientific patent related information in Biotechnology including BiSSAP
• EBI, UNIPROT, BRENDA and other freely-available sources of information
In addition, ad-hoc talks around innovation and/or patents in Biotechnology have been given by the local hosts.
In total, more than 1,500 participants were registered to the EPO-SLING awareness events, around 1,000 participants assisted (EPO online events attracted an average of 10 participants per session).
BiSSAP (Biological Sequence Submission application) is a standalone submission tool to enable the preparation and filing of DNA/protein sequences to a patent office according to the prescribed standard (WIPO ST.25). BiSSAP also supports a new standard being proposed to the World Intellectual Property Organization, that reproduces INSDC XML. Progress towards the adoption of this standard is partly due to our submission software.
With respect to the new standard proposal (ST.26) in 2010, the EPO received the mandate to prepare a recommendation of said standard for adoption as a WIPO standard. A status report was presented during an ad-hoc session of the Committee on WIPO Standards where the Task Force presented the proposal to the users community for comments. The next steps are as follows (http://tinyurl.com/dxdjkrt):
• Sep to Nov, 2012 - last round of consultations on WIPO's wiki
• Early 2013 - adoption of the draft standard at 3rd session of the Committee on WIPO Standards (CWS/3)
It was initially foreseen to also have a web-based service. However, the option of deploying a web-based submission tool to the applicant created concerns. Therefore the EPO implemented the expert rules for validation within the BiSSAP import and verification module. This solution maintains the functionality initially considered, yet does not generate traffic over the internet before the patent application is filed.
BiSSAP developments continue beyond the scope of SLING. BiSSAP 1.2 has been released to the public at the beginning of September 2012 (see www.epo.org/bissap). Among the main features, the software:
• can create a sequence listing from scratch
• allows to import existing sequence listing(s) and possibly modify/amend them
• enables to import flat file sequences and annotations from EMBL entries/FASTA formatted sequences
• has built-in expertise to help the user and circumvent common mistakes
• allows only IUPACC characters
• allows only a combination of valid feature keys/qualifiers as prescribed by INSDC (international Nucleotide Sequence Database Cooperation)/Uniprot
• provides assistance with respect to the source (Updated NCBI Taxonomy database) of the sequences
• allocates the conceptual translation as prescribed by the EMBL/EBI, e.g. according to the sequence source
• provides real-time verification of an ongoing project by visual means and reporting
• has a batch verification engine, allowing the verification of multiple sequence listings in one operation
• converts between the existing standard WIPO ST25 and the in-progress proposal WIPO ST26
3. Patent sequence database growth
To maintain a sustainable value-chain from acquisition to the publication of biological sequences disclosed in published patent documents, the EPO undertook a major re-engineering of several components within its workflow. Central to this was the complete redesign of the master database for biological sequence information. The BiSSAP verification engine is now used to verify incoming sequence data. This being in place, from November 2011 onwards, the EPO further adapted verification rules, and put a new pipeline in production to release sequence data filed as part of patent applications to the public. Besides the normal production, the new service enabled to verify, store and publish data from 'mega applications'.
Around 28 million sequences underwent verification and storage for subsequent publication and distribution. Finally cross-referencing to claimed sequences was added as part of annotations. From March 2009 until August 2012, the number of patent sequence entries in EMBL grew from 8 million to 24 million.
EBI DATA SERVICES
Data Services represent c. 75% of the EBI activities, and involve collecting, organizing and making available a wide range of biomolecular data, and providing services allowing scientists to exploit those data (which include genes and genomes, protein sequences and structures, molecular behaviour and interactions). As well as users in the fields of human biology, medical research and drug discovery, and domains as diverse as timber production, agriculture, fisheries and nutrition, new users are emerging in the areas of personal care products, medical device research and biofuels. Under the SLING project, the EBI aims to give access for research communities to all of these data via the worldwide web, for download using ftp and for programmatic access through application programming interfaces.
Access to EBI resources has been consistently metered using a real-time log-file analyzer to show web server usage patterns. During months 1 to 42 of the SLING project, the daily average web hits has increased from c. 2.5 million (before SLING) to c. 6.3 million (SLING P3, excluding Ensembl): an increase of 152%; and from c. 4.1 million (before SLING) to c. 6.9 million (SLING P3, including Ensembl): an increase of 112%. The monthly average unique users has increased from c. 204,000 (before SLING) to c. 312,000 (SLING P3): an increase of 53%. The monthly average of programmatic jobs has increased from c. 308,000 (before SLING) to c. 3.1 million (SLING P3): an increase of 900%.
For individual EBI-based resources:
• ArrayExpress - the average monthly unique visitors has increased from c. 7,000 (before SLING) to c. 10,500 (SLING P3); an increase of 50%. The average monthly web hits has decreased from c. 1.57 million (before SLING) to c. 1.36 million (SLING P3); a decrease of 13%.
• Bioinformatics Training Network - the average monthly unique visitors has increased from zero (i.e. BTN did not exist before SLING) to c. 800 (SLING P3). The average monthly web hits has increased from zero (i.e. BTN did not exist before SLING) to c. 21,000 (SLING P3).
• ChEBI - the average monthly unique visitors has increased from c. 8,800 (before SLING) to c. 23,100 (SLING P3); an increase of 162%. The average monthly web hits has increased from c. 0.46 million (before SLING) to c. 4.00 million (SLING P3); a massive increase.
• EBI Train Online - the average monthly unique visitors has increased from zero (i.e. Train Online did not exist before SLING) to c. 2,200 (SLING P3). The average monthly web hits has increased from zero (i.e. Train Online did not exist before SLING) to c. 153,000 (SLING P3).
• ENA - - the average monthly unique visitors has increased from zero (i.e. ENA did not exist before SLING) to c. 21,000 (SLING P3). The average monthly web hits has increased from zero (i.e. ENA did not exist before SLING) to c. 8.4 million (SLING P3).
• IntAct - the average monthly unique visitors has increased from c. 3,500 (before SLING) to c. 5,300 (SLING P3); an increase of 51%. The average monthly web hits has increased from c. 0.33 million (before SLING) to c. 5.00 million (SLING P3); a massive increase.
• Microarray Gene Expression Atlas - the average monthly unique visitors has increased from c. 436 (before SLING) to c. 16,500 (SLING P3); a massive increase. The average monthly web hits has increased from c. 0.01 million (before SLING) to c. 1.3 million (SLING P3); a massive increase.
• PRIDE - the average monthly unique visitors has increased from c. 1,800 (before SLING) to c. 2,800 (SLING P3); an increase of 56%. The average monthly web hits has increased from c. 0.08 million (before SLING) to c. 0.3 million (SLING P3); an increase of 275%.
• PSICQUIC - the average monthly unique visitors has increased from zero (i.e. PSICQUIC did not exist before SLING) to c. 1,200 (SLING P3). The average monthly web hits has increased from zero (i.e. PSICQUIC did not exist before SLING) to c. 5.2 million (SLING P3).
• SRA - the average monthly unique visitors has increased from c. 20 (before SLING) to c. 334 (SLING P3); an increase of 19%. The average monthly web hits has increased from a few hundred (before SLING) to c. 185 million (SLING P3); a massive increase.
SIB DATA SERVICES
The objective of this work package was the provision of data services in the form of protein annotation by the Swiss-Prot group of the SIB, within the framework of the UniProtKB/Swiss-Prot protein knowledgebase (http://www.uniprot.org/) a high quality curated resource of protein sequences and functional annotation. This work has been carried out in close collaboration with the UniProt team of the PANDA group at the European Bioinformatics Institute (EBI). During the course of SLING, more than 124,000 curated sequence records were added to UniProtKB/Swiss-Prot.
The workflow for the manual curation of UniProtKB/Swiss-Prot records is described in our online documentation (see http://www.uniprot.org/help/biocuration and www.uniprot.org/docs/sop_manual_curation.pdf). Within the context of SLING we place a particular emphasis not only on the curation of functional information from experimental literature but also on the manual curation of the protein sequences of UniProtKB/Swiss-Prot records. Sequence curation involves the identification and annotation of biologically-relevant sequence differences, such as alternative splicing events and natural variations, as well as the resolution of sequence discrepancies arising due to sequencing errors and erroneous gene model predictions; such discrepancies are not uncommon. The majority of UniProtKB/Swiss-Prot sequences are based on translations of nucleotide submissions to the International Nucleotide Sequence Database Collaboration, INSDC, which includes the European Nucleotide Archive (ENA), GenBank, and the DNA Data Bank of Japan (DDBJ). These sequence submissions may vary significantly in quality, but the archival nature of INSDC means that they cannot be altered or corrected at source. During manual curation of UniProtKB/Swiss-Prot records, corrected sequences from INSDC are annotated with a flag that indicates when a sequence correction has been performed, and that also specifies the type of correction that was required (such as the rectification of an erroneous gene model or incorrect initiation codon). During the course of SLING, over 7,800 such corrections were made to the INSDC sequences of both new and existing UniProtKB/Swiss-Prot records. The manual analysis and correction of erroneous sequences that is performed by Swiss-Prot curators is essential to maintain the high quality and accuracy of UniProtKB/Swiss-Prot sequences, and also has a beneficial effect on those downstream resources and applications that depend on UniProtKB sequences. These include genome annotation pipelines such as Ensembl, providers of sequence signatures for the classification and annotation of protein sequences such as Pfam, PROSITE, and HAMAP, and community efforts to develop and benchmark phylogenetic methods by the Quest for Orthologs (QfO) consortium. The sequence corrections proposed by Swiss-Prot curators are also propagated to other sequence resources: information about sequence corrections and updates is communicated directly to other resources such as Ensembl, and HAVANA (http://www.sanger.ac.uk/research/projects/vertebrategenome/havana/) and Swiss-Prot curators also contribute actively the Consensus CoDing Sequence (CCDS) project, which aims to establish a consensus CDS for each gene in human and mouse. These activities ensure that the sequence curation performed in UniProtKB/Swiss-Prot enhances the quality of other sequence resources, providing the maximum benefit for the life sciences community.
Following sequence curation each UniProtKB/Swiss-Prot entry is enriched with functional annotation derived from experimental information published in peer-reviewed journal articles. This annotation includes protein and gene names as well as functional information including catalytic activity and cofactor requirements, pathway membership, protein interactions, and functionally important sites, such as catalytic residues and post-translational modifications. The usefulness of UniProtKB/Swiss-Prot annotations is determined not only by their accuracy and completeness, but also by the ease with which this information can be integrated with that from other resources. For this reason we have invested significant resources in the standardization of UniProtKB/Swiss-Prot annotation, including the development and application of curation guidelines, standards, and ontologies for the curation of biological knowledge. Work on the standardization of protein nomenclature between UniProtKB and other resources such as RefSeq and INSDC, as well as the adoption of MIMIx as a standard for the curation of protein interaction data by Swiss-Prot, is described in WP10. Other developments specifically discussed here relate to the annotation of functional information using the Gene Ontology (GO), the annotation of enzyme functions using the Enzyme Classification (EC) of the IUBMB, the annotation of post translational modifications (PTMs) using our own controlled vocabularies, and the annotation of information from protein structures.
During the course of SLING, we significantly expanded our annotation efforts using GO, producing over 85,000 manual annotations. GO is a de facto standard for the representation of knowledge on biological functions and processes that is used by all the major resources of curated information for experimental model organisms including SGD, TAIR, WormBase and FlyBase. To complement these resources most effectively, we focus our GO annotation efforts on sequences from taxonomic groups that are not covered by these resources, notably human proteins, which accounts for around 40% of our annotation output. GO annotations are made available both through the Gene Ontology Annotation database GOA at the EBI and also within UniProtKB records.
In addition to developing our GO annotation efforts we also focused on the standardization of annotation for enzymatic functions, obtaining membership of the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB), and continuing to curate the official Enzyme Nomenclature (EC classification) defined by this body. In total, we curated 724 new complete EC numbers in UniProtKB/Swiss-Prot during SLING (of 776 new EC numbers created by the NC-IUBMB). The EC classification is a common vocabulary for all curation efforts that gather and provide data on enzymatic reactions including KEGG and MetaCyc, as well as efforts to model and predict cellular metabolic behaviour. By actively contributing to the development of the EC nomenclature and consistently applying it in our own annotation, we enhance the utility of UniProtKB/Swiss-Prot and its integration with other knowledgebases as well as resources of metabolic models. To provide enhanced functional context for annotated enzymatic activities we have also developed and apply our own controlled vocabulary for enzymatic pathways in UniProtKB/Swiss-Prot, UniPathway.
Post-translational modifications (PTMs) can dramatically alter the structure of proteins, influencing or altering their functions and interactions, and new post-translational modifications are being discovered on a regular basis. We continued to expand the controlled vocabulary for PTMs used in UniProtKB, creating 85 new terms during the course of SLING. Our PTM vocabulary is fully-mapped to the PSI-MOD ontology, facilitating its application in the annotation of results from mass-spectrometry experiments. We also continued to apply our controlled vocabulary for PTMs to UniProtKB/Swiss-Prot, curating over 100,000 individual PTMs and associated information into relevant UniProtKB/Swiss-Prot entries. In related work, we have developed a procedure for the annotation of information from high-throughput mass spectrometry studies which leverages published experimental metadata for the annotation of unique peptides in UniProtKB proteomes. This procedure will facilitate the integration of PTM data from PRIDE and other member databases of the ProteomeXchange consortium, and will reduce the accumulation of false positive annotations that occurs when multiple datasets are combined ad hoc.
Protein structures constitute a rich source of information on protein function and evolution, including the roles of specific sites and the modifications that affect them. During the course of SLING, we continued to curate structural information from publications, curating over 5,000 references describing protein structures and more than 40,000 cross-references to the PDB resource in more than 4,000 UniProtKB/Swiss-Prot records.
All UniProtKB sequence data and functional annotations are made available in monthly releases through the UniProt website (http://www.uniprot.org) for query or download in a variety of formats including FASTA, text, XML, and RDF/XML, a recognized W3C standard for publication to the Semantic Web. The encoding of annotations in RDF/XML also forms the basis for methods to ensure data consistency using rules and constraints based on the RDF query language SPARQL and SPARQL Inferencing Notation (SPIN). All the data that we provide are accompanied by detailed documentation. To help users get the most out of UniProt, we have also participated in a number of SLING Roadshows targeting experimental researchers from areas of Europe not previously served by these types of activities in FELICS. We have also helped establish common teaching materials, guidelines, and best practices for the teaching of bioinformatics as founding members of the Bioinformatics Training Network (BTN). These activities are described in WP02.
NEXT GENERATION SEQUENCE DATA STORAGE AND INTERPRETATION
SLING WP06, 'Next generation sequence data storage and interpretation', has achieved its two stated objectives: 1) we have collected, integrated and organised next generation sequence data, and 2) we have developed (for epigenomics data) an additional layer of analysis along with the ability to retrieve raw supporting next generation sequence data for asserted epigenomic features.
The major foreground achievements are as follows:
• The Epigenomics Checklist: a checklist of minimal information that we expect from data submitters to the EBI when describing raw data sets from next generation sequencing platforms used in high-throughput studies of epigenetic features (http://www.ebi.ac.uk/ena/about/epigenomics_submissions). This checklist has been developed in order to assist practically those preparing their data for submission to EBI, and those developing tools to assist in submission processes. Providing information fields to be reported that are mandatory, recommended and optional, it balances the need for richness of information (to make the data useful downstream) and simplicity of reporting (to make it easy for the data provider to comply with the standard). Finally, it presents mappings into SRA XML elements in order that the information in these fields can be systematically represented by data providers.
• Data exchange between NCBI and EBI: Background provided by EBI at the start of SLING included a simple next generation sequence metadata (but not data) mirroring process. Under SLING, we have 1) made major improvements to the automation and robustness of this process, 2) have made it a two-way metadata exchange pipeline, and 3) in addition implemented a full data exchange pipeline. Under the well-established ‘informed pull’ data exchange model, each of the exchange partners maintains a list that is exposed to the other partner, showing accessions for new and updated data in the local resource. Access on a regular basis to this list under automated procedures triggers the transfer of the listed data from the remote to the local site and subsequent loading into local databases.
• Submission tools: For data submitters who route their data directly to EBI (as opposed to exchange partners), we took as background a prototype for a RESTful/drop-box submission service. SLING foreground includes the roll-out of this into full production, along with significant improvements that enable full automation of submissions. Functions include automated submission receipt and status reporting (with two iterations of these technologies provided under SLING) and automated re-launch of loading jobs.
• Metadata text-search and link to API: A text-search functionality has been provided under SLING that allows users to provide text that may appear in the metadata layer (within indexed fields in SRA XML objects) and return all of those records where the text is present. Taking the background of the EBI search, but adding in significant foreground in the form of a wrapper that routes users directly to records when known accessions are provided as query and supports trace identifier look-up, where, due to large numbers of records, indexing under EBI search is not possible. The results of these searches are provided as presentations of SRA metadata (and other) objects, for which links to views of records, with rendering enhanced under SLING, in which datafile paths are provided as links and can be consumed programmatically in the form of machine-readable tabular presentations.
• The Ensembl ‘Regulatory Build’ methodology: An analsysis pipeline has been developed to provide a way to annotate regulatory features in a cell-type or sample-specific manner, giving high-level evidence-based interpretation of the regulatory status of the genome. The underlying infrastructure of the tracking database delivers a system by which input data can be uniformly managed and accessed from the ENA, which provides a consistent input to the Regulatory Build analysis pipeline. The Ensembl ‘funcgen’ API and analysis pipeline are publicly-available, and data access is also provided via direct SQL, the Ensembl browser visualizations, or via a specialised Ensembl ‘Regulation’ mart (i.e. a web-based data-mining tool).
• An Ensembl dedicated view: A dedicated web-view has been developed which provides a portal between Ensembl and the sequence archives which host the supporting data for the Ensembl ‘Regulatory Build’. Access is provided via species-specific links from the Ensembl online documentation, or via specific ‘Source’ links provided by pop-up menus, available by clicking on one of the supporting evidence features.
NEW GENERATION SEQUENCING FOR TRANSCRIPTOMICS
SLING WP7 has provided the infrastructure, methods, and curatorial expertise to acquire, annotate, integrate and disseminate Next Generation Sequencing based transcriptomics data across the EBI, between SLING partners, in Europe and Internationally. The main highlights are:
1. A community-generated and adopted format for content of next generation sequencing-based transcriptomics data – Minimum Requirements for Next Generation Sequencing data – MINSEQE. The MINSEQE standards are a continuation from the well-known MIAME standard for microarray data, and was developed by SLING partners in collaboration with the international group Functional Genomics Data Society. The community adopted format MAGE-TAB was extended to deal with NGST data.
2. A data-exchange agreement between the world major repositories, the USA NCBI Gene Expression Omnibus (GEO) and the European ArrayExpress. Both these databases collect and store metadata describing transcriptomics and other functional genomics NGS experiments and the related processed. The raw data are stored in the Short Read Archive at NCBI and EBI, and exchanged independently. Exchanging the experimental metadata and processed data thus closes the data exchange circle.
3. Open Source software supporting the data exchange agreement with conversion scripts, and curator support tools. This is important as for historic reasons, GEO and ArrayExpress developed different formats. Moreover, at ArrayExpress we need to annotate the data using Experimental Factor Ontology to achieve consistency within EBI.
4. A open source spreadsheet based submission tool for biologists/bioinformaticians submitting NGST data – MAGETabulator. MAGETabulator was developed to generate MAGE-TAB templates for NGST data
5. Train online resource supporting the submission format, help pages and user support from biological experts. An e-learning portal describing the data submission was developed.
6. In person training courses, with supporting tutorial material were developed.
7. A new version of the ArrayExpress graphical user interface with support for search and display of the NGST data was developed. An important feature of this interface was integrated query and display mechanism seamlessly integrating NGST and microarray based transcritpomics data queries.
8. The ArrayExpressHTS R package was developed for process NSGT data for inclusion in the Gene Expression Atlas. This package was included into the popular Bioconductor set of packages.
9. Links between the ENA short read component, ArrayExpress and Ensembl have been established. A robust pipeline was developed to channel the raw NGST data via ArrayExpress to ENA, recording and storing the respective metadata in ArrayExpres. For a high-volume NGS data submission pipeline from the Wellcome Trust Sanger Institute to ENA, a reverse pipeline was implemented, storing the raw data in ENA, and passing the processed transcriptomics data and experimental metadata to ArrayExpress.
10. An ontology containing terms supporting the description of NGST experiments used in submission tools, curation processes and the ArrayExpress GUI was developed and used in data curation.
11. A number of papers describing these developments were published in major journals (see references)
1. Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, Rustici G, Williams E, Parkinson H, Brazma A. Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res. 2010 Jan;38(Database issue):D690-8. Epub 2009 Nov 11
2. Shankar R, Parkinson H, Burdett T, Hastings E, Liu J, Miller M, Srinivasa R, White J, Brazma A, Sherlock G, Stoeckert CJ Jr, Ball CA. Annotare--a tool for annotating high-throughput biomedical investigations and resulting data. Bioinformatics. 2010 Oct 1;26(19):2470-1. Epub 2010 Aug 23
3. Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, Emam I, Farne A, Hastings E, Holloway E, Kurbatova N, Lukk M, Malone J, Mani R, Pilicheva E, Rustici G, Sharma A, Williams E, Adamusiak T, Brandizi M, Sklyar N, Brazma A. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res. 2011 Jan;39(Database issue):D1002-4
4. Xue V, Burdett T, Lukk M, Taylor J, Brazma A, Parkinson H. MageComet--web application for harmonizing existing large-scale experiment descriptions. Bioinformatics. 2012 May 15;28(10):1402-3
5. Kapushesky M, Adamusiak T, Burdett T, Culhane A, Farne A, Filippov A, Holloway E, Klebanov A, Kryvych N, Kurbatova N, Kurnosov P, Malone J, Melnichuk O, Petryszak R, Pultsin N, Rustici G, Tikhonov A, Travillian RS, Williams E, Zorin A, Parkinson H, Brazma A. Gene Expression Atlas update--a value-added database of microarray and sequencing-based functional genomics experiments. Nucleic Acids Res. 2012 Jan;40(Database issue):D1077-81
6. Goncalves A, Tikhonov A, Brazma A, Kapushesky M. A pipeline for RNA-seq data processing and quality assessment. Bioinformatics. 2011 Mar 15;27(6):867-9
7. Gabriella Rustici, Nikolay Kolesnikov, Marco Brandizi, Tony Burdett, Miroslaw Dylag, Ibrahim Emam, Anna Farne, Emma Hastings, Jon Ison, Maria Keays, Natalja Kurbatova1, James Malone1, Roby Mani1, Annalisa Mupo2, Rui Pedro Pereira, Ekaterina Pilicheva, Johan Rung, Anjan Sharma, Amy Tang, Tobias Ternent, Andrew Tikhonov, Danielle Welter, Eleanor Williams, Alvis Brazma, Helen Parkinson, Ugis Sarkans. ArrayExpress update - trends in database growth and links to data analysis tools. Accepted in NAR Database Issue 2013
Over the course of SLING, we have developed and refined the representation of quantification data in PRIDE (based on our own experience and on the feedback received) (D08.01 'Updated PRIDE database and XML schemata to allow storage and dissemination of quantitative proteomics data ').
In parallel, we have been working on a simplified data model and representation for quantification information (implemented also in the new data format, mzTab, currently under internal PSI review). The framework is now mature enough to support the main quantification approaches. The techniques currently supported are Total Ion Count (TIC), emPAI, SILAC, ICAT, ICPL, iTRAQ, TMT, O18, and AQUA® (absolute quantification). The PRIDE controlled vocabulary has been extended accordingly to include new terms related to quantification techniques (http://www.ebi.ac.uk/ontology-lookup/browse.do?ontName= PRIDE, see the new ‘Quantification parameter’ branch).
In February 2011, we first released a new tool called PRIDE Inspector. PRIDE Inspector is a new Java open-source stand-alone application for visualizing and performing an initial assessment of the quality of MS proteomics data. A Java Web Start version from PRIDE Inspector is also accessible from the PRIDE web interface (http://www.ebi.ac.uk/pride). With PRIDE Inspector, researchers can examine their own data sets before the actual submission to PRIDE is performed, or access data already in PRIDE for data-mining purposes via a PRIDE public MySQL instance. Very importantly, it can also be used by journal’s editors/reviewers since it facilitates the thorough review of submitted data at the pre-publication stage. Currently, it supports fast loading of popular file formats such as: PRIDE XML and mzML (community standard for MS data), and also, as indicated above, it gives direct access to a PRIDE MySQL database instance.
PRIDE Inspector provides different views, with each view focusing on a different aspect of the data: experimental details, spectrum, protein, peptide, quantification and ‘summary charts’. A major strength of this tool lies in the possibility to perform a first assessment on data quality, since a variety of simple charts based on the data are generated automatically.
Since it was originally released, PRIDE Inspector has become the major PRIDE access tool. We have implemented a new module in the Inspector, the quantitation view. This view is now the general starting point for links to external analysis tools, currently Ensembl Chromosome maps and Reactome, and likely additional ones in the future. The quantification view (See file 'Quantitative Proteomics - S&T Figures' Figure 1) focuses on facilitating analysis of reported quantification values between proteins and peptides, and very importantly, it refers to the relevant experiment metadata and spectra.
The quantification view consists of five main components: the sample metadata component, the protein quantification component, the peptide quantification component, the protein quantification histogram and a spectrum viewer. The sample metadata component shows a list of all the reagents used in the experiment, and also highlights their related sample metadata, such as: species, tissue, cell line, GO term and sample description.
The protein quantification component is designed to be the main focus of the quantification view. At the top of the component, a summary of the quantitative methods used is displayed. For each protein, all the ratios are shown based on the control sample selected. A bar chart is also displayed along the side of the values, to highlight if that particular protein is up- or down-regulated with respect to the control sample. Three additional buttons are also included: the user can then obtain remotely protein details such as protein names, protein accession status and sequence coverage using the “protein details” button. Users can also set a different control quantification reagent using the “Set Control Sample” button. Finally, the “Export” option can be used to filter the quantification results and export them to a tab delimited file, or link the filtered data directly to the Ensembl chromosome view or the Reactome pathway view.
Mapping protein quantitative values to Reactome pathways: quantitative proteomics data can be mapped to Reactome pathways using the “Export” button in the protein quantification component. Clicking on the button will show an export dialog (See file 'Quantitative Proteomics - S&T Figures' Figure 2). There, a table of protein identifications will be displayed. Users can further restrict the chosen proteins based using the filter options at the bottom section of the dialog: users can then select which proteins they choose based on their level of expression (as percentage of up/downregulation, with respect to the control sample). Users can also choose not to filter the initial list provided. Then, clicking on the "Reactome Pathway" button will initiate the Reactome analysis.
As a first step, the Reactome pathway analysis will be triggered. The given identifier set will be mapped to Reactome, and a list of all Reactome pathways will be provided, indicating the coverage of each pathway with the current identifier set (See file 'Quantitative Proteomics - S&T Figures' Figure 3). Next, the user can select a pathway of interest, and open the Reactome pathway browser (See file 'Quantitative Proteomics - S&T Figures' Figure 4). In this view, the expression values from PRIDE will be dynamically mapped to a colour-encoded heat map. Each protein entity in the Reactome pathway is then colour-coded according to the expression value.
As a key feature, this view then allows to "cycle" through multiple series of expression values, for example different cell cycle stages, or healthy/diseased tissue comparison.
Implementation: PRIDE Inspector is implemented in Java 1.6. It is normally used as a Java Webstart application. This posed a specific problem, as the Reactome pathway analysis view is implemented as a web application, originally accessible only as a GET request. As PRIDE datasets can be large, we have changed the protocol to POST requests, allowing transmission of large datasets. A Java application cannot directly send a POST request and trigger an external web browser to display it, so we had to create a workaround. The Inspector writes a temporary HTML page, then triggers the user's web browser to open this page. Once the HTML page is loaded, it will automatically send a query to the Reactome server, and then visualise the results.
Summary: In WP8, we have extended the existing PRIDE database to appropriately represent quantitative protein expression data, and to provide both tabular download and external analysis capabilities for such data. With the connection to the external Ensembl and Reactome tools, we have done the first steps to the development of PRIDE from a (mass spectrometry specialist resource into a much wider protein expression resource for molecular biology.
DYNAMIC INTERACTION NETWORKS
The objectives of this work package were the adaptation of the IntAct molecular interaction database to new data types, and the enhanced integration of IntAct into a systems biology context. These objectives have been implemented in three tasks:
Task 1: Dynamic molecular interaction data
Significant progress has been made capturing molecular interaction data; however the dynamics inherent within these networks are often overlooked. Proper cellular functioning requires coordination of a large number of events. Identifying temporal and contextual signals (underlying the interactions) is important to understand cellular function. Interaction dynamics can describe in part these events, for example, how cells respond to environmental cues or how an interaction network changes during development or differentiation. In this task we enhanced IntAct to allow efficient deposition, curation, display, and analysis of dynamic interaction data.
The PSI-MI schema was not designed to describe spatial, temporal and/or contextual variation interaction variables. However the XML schema was always designed to be flexible, and the ability to add ‘attributes’ to each entity was present in the schema from the start. In order to describe dynamic variables, we have utilized the experiment and interaction "attribute list" entities rather than making changes to the schema itself. We propose new attribute terms to describe dynamic variables making use of the attribute list in the "experimentDescription" and "interaction" entities. The use of experimental attribute controlled vocabulary terms ‘variable’ and ‘variable_2’ allows us to describe variables in 2 dimensions, for example a time course following the addition of an agonist at two different concentrations, and this model is scalable to ‘n’ dimensions. Similarly, the corresponding terms ‘variable_condition’ and ‘variable_condition2’ as interaction attributes allows us to fully-describe the experimental conditions under which each separate set of interactions has been observed; again this annotation is scalable. This usage of the schema is currently limited to the IntAct database but it is intended that this will be presented to the Molecular Interaction workgroup of the HUPO-PSI at the April 2013 meeting (in Liverpool) for adoption as an accepted extended use of the schema, with the attribute terms being then added to the PSI-MI controlled vocabulary.
Following this concept we have updated our curation and data management tools in IntAct. With these modifications, curators are able to define dynamic data in the IntAct database through the editor, the data are captured in the PSI-MI XML format and users can visualize them through the Detailed View of the IntAct interface (See file 'Dynamic Interaction Networks - S&T Figures' Figures 1 & 2).
The interaction dynamic data network in the IntAct interface by default displays all the interactions from one experiment.
In the example shown, the authors have looked at changes in the interactions between a small group of host proteins over a time course following viral infection. Each condition for the "Time after Sendai viral infection (hours)" variable highlights a set of interactions. Condition "2" highlights interactions of “stat6_human” with “tbk1_human”, “mavs_human” and “tm173_human” (figure 1). Condition "6" highlights an interaction between “stat6_human” and “tm173_human” (See file 'Dynamic Interaction Networks - S&T Figures' Figure 2). And condition "12" highlights interactions of “stat6_human” with “tbk1_human”, “mavs_human” and “tm173_human”.
Task 2: Third party data integration
We have developed the PSI Common Query Interface (PSICQUIC), a standard interface to query multiple interaction data resources with the same query. Query results are then integrated on the client side. PSICQUIC has been well received in the community, and implemented by 25 resources (August 2012) providing more than 150 million interactions.
The original plan for the implementation of this task was to import and reformat data from third parties into an “IntAct-light” section, visually separated from IntAct core data due to different curation policies, but accessible through the same web interface. However, this policy would have raised issues with data from sources which allow academic use, but do not allow commercial use or redistribution, like for example HPRD. In addition, such a “composite” resource would put considerable strain on the IntAct team due to the need for constant re-imports and adaptation of import policies whenever external databases change their output formats, file locations, etc.
To circumvent these challenges, we have followed another policy, namely that of database federation rather than data warehousing. In the context of the HUPO Proteomics Standards Initiative (PSI), we have developed the PSI Common Query Interface, a standard interface for querying of molecular interaction data. File 'Dynamic Interaction Networks - S&T Figures' Figure 3 illustrates the PSICQUIC concept. Queries can be formulated in a comprehensive query language, based on Lucene indexing, and the server response is always presented in the standard PSI MITAB format. We have developed an open source PSICQUIC server, allowing efficient deployment of a PSICQUIC service. We also provide an open source client, querying all registered services in turn, and presenting the results in a simple web interface.
The PSICQUIC system has been enthusiastically received by the molecular interaction data providers: as of August 2012, all public major interaction data resources implement the PSICQUIC interface.
Based on the comprehensive PSICQUIC software tools and the wide implementation of the PSICQUIC interface, we are now providing users with a comprehensive query response, from IntAct and all other registered PSIQUIC servers. File 'Dynamic Interaction Networks - S&T Figures' Figure 4a-4c shows a typical IntAct query, and response from IntAct and PSICQUIC servers.
The use of PSICQUIC has allowed us to implement integration of third-party data into IntAct in a lightweight and flexible manner, minimizing maintenance overhead and duplication of data. The implementation of third-party data via PSICQUIC also allows us to circumvent issues of data ownership. If a data source puts restrictions onto data availability, they can implement these restrictions in the PSICQUIC server, for example, through notes on license requirements in the PSICQUIC registry, or through access restrictions based on IP addresses.
Task 3: PRIDE and Reactome integration
One of the key methods to generate molecular interaction data is Tandem Affinity Purification (TAP) . Essentially, a known bait protein is isolated from a cellular context together with its interaction partners (prey proteins), which are identified by mass spectrometry. The detailed mass spectrometry data identifying a prey protein can be highly-relevant for the interpretation of the interaction, to assess how reliable the identification of the particular prey protein is. TAP-related MS-based methodologies are currently growing in relevance, as they allow the generation of quantitative information in addition to qualitative information.
At the beginning of the SLING grant, the connection between molecular interaction data and supporting mass spectrometry data was difficult to make, because they were located in two unconnected databases: IntAct for molecular interactions, and PRIDE for mass spectrometry. The aim of this work was to allow a better connection between the two types of data, and to access supporting mass spectrometry evidence in PRIDE directly from the interaction data in IntAct.
We decided to keep interaction data and supporting mass spectrometry evidence separate, and link them through an existing standard format, the Distributed Annotation System (DAS). DAS is a REST-based web service protocol to represent the location of functional regions (features) on nucleotide or protein sequences. DAS has been used extensively for example in the EU-funded BioSapiens project. Previously, the DASTY DAS browser has been integrated into the IntAct molecular interaction database to allow the visualisation of known binding sites (from IntAct) on protein sequences (from UniProt) in the context of other relevant data, for example annotated functional domains (from InterPro).
We have implemented DAS tracks for PRIDE data which facilitate the visualisation of:
1. all peptides in PRIDE for a given protein sequence
2. all peptides in PRIDE for a given protein sequence in a given PRIDE experiment
File 'Dynamic Interaction Networks - S&T Figures' Figure 5 provides an example DASTY view for UniProt AC P12931. The grey track “consensus region” summarises all identified peptides from PRIDE for this protein, irrespective of experimental context. Each of the green “polypeptide” tracks below that show identified peptides for the protein from one particular PRIDE experiment. At one glance, peptide coverage of the protein is visible in a user-friendly manner.
While not directly a deliverable, we are using the same infrastructure to export post-translational protein modification (PTM) data in a separate DAS track. PTMs are a potential regulator of protein interactions, and therefore inclusion of these in DAS tracks accessible directly within IntAct provides a novel means for hypothesis formation, for example when annotated binding regions from IntAct and PTMs from PRIDE overlap.
To implement the second aim of this task (the integration of IntAct molecular interaction data and Reactome pathway data), the PSICQUIC protocol has be re-used. For any protein entity annotated in the pathway, known interactions from IntAct or another PSICQUIC resource can be displayed, and switching from one molecular interaction data source to another is as easy as selecting a different one in a menu (See file 'Dynamic Interaction Networks - S&T Figures' Figure 6).
ENHANCING PROTEIN ANNOTATION STANDARDS
The major objectives of this work package were the development of enhanced annotation standards for the annotation of protein interactions in UniProtKB, and the development and application of protein naming standards by UniProt and their adoption by other resources.
Annotation of protein interactions in UniProtKB: A global map of high-quality curated protein-protein interactions in cellular systems is an essential prerequisite for the generation and validation of functional hypotheses regarding individual proteins and the systems in which they act, both in normal physiological contexts and disease states.
Prior to the initiation of this work, information on protein interactions was routinely curated in UniProtKB in the form of free-text natural language descriptions, which provide a rich source of interactions and associated contextual information for biologists. To enhance the utility of these annotations, and to facilitate their integration with information from other resources, we have extended the UniProt curation workflow to include the curation of binary protein interactions to the minimal information standard MIMIx (‘Minimal Information about a Molecular Interaction experiment’). MIMIx provides a concise and unambiguous description of the interacting molecules as well as the interaction type and the experimental method used to demonstrate the interaction, using standard identifiers and ontologies developed by the Molecular Interaction workgroup of the Human Proteome Organization Proteomics Standards Initiative (HUPO PSI-MI). MIMIx is an accepted standard of the IMEx (International Molecular Exchange) consortium, which includes many of the major repositories of curated interaction data such as IntAct, BioGRID, DIP, and MINT. MIMIx curation is performed using a common tool that was co-developed and is shared by both UniProt and IntAct. During the course of SLING, a total of ten trained curators have used this tool to curate 2,815 MIMIx-level binary interactions from 1,407 experiments described in 409 publications. These MIMIx-level annotations are stored in the IntAct database and made available through the IntAct website (http://www.ebi.ac.uk/intact/main.xhtml) and are also exported back to UniProtKB (http://www.uniprot.org/) along with data curated by IntAct, for display in UniProtKB records.
To display this information we have developed an extension of the UniProtKB format which allows relevant information from MIMIx-level annotations to be stored and displayed. We are also in the process of mapping the PSI-MI ontology terms used in MIMIx to the UniProt evidence code ontology, in order to allow this information to be displayed in a manner consistent with that of other annotations from UniProt, using a common evidence attribution namespace.
Member databases of the IMEx consortium also curate interaction data to a higher level of experimental detail, IMEx-level, which may include information such as the precise constructs and experimental system used. Many of the MIMIx interactions produced by Swiss-Prot can be upgraded to satisfy IMEx requirements with minimal effort, and the Swiss-Prot group is now one of the largest contributors of binary protein interactions curated to IMEx-level (see http://www.ebi.ac.uk/intact/imex/main.xhtml?query=).
The availability of data curated to common standards, using shared ontologies and namespaces, facilitates the initial steps in the generation of high-quality protein interaction maps for the study and simulation of biological systems. By integrating MIMIx-level curation into the UniProt workflow we have increased the provision of structured, accessible, and high quality interaction that is fully compatible with that from IMEx member databases and other resources. In order to more fully participate in the continuing development of common protocols and ontologies for the curation of protein interactions, the Swiss-Prot group has joined the IMEx consortium as a representative of the UniProt consortium (see http://www.imexconsortium.org/).
This work provides effective synergies with developments from other EU-funded framework projects; IMEx is supported by the European Commission under PSIMEx, contract number FP7-HEALTH-2007-223411, while the ontology and standards development work of the HUPO PSI-MI working group was supported by the European Commission under FELICS, contract number 021902 (RII3), within the Research Infrastructure Action of the FP6 'Structuring the European Research Area' Programme.
Naming standards for proteins: Establishing protein naming standards has been a focal point of many curation efforts. The protein name is a key unit of information exchange, and the use of standardized, up-to-date, and well formatted protein names in genome annotation pipelines and sequence databases aids functional comparisons and other common data integration tasks.
UniProt has previously developed a set of publicly-available protein naming guidelines (http://www.uniprot.org/docs/nameprot) that are applied during the manual curation of UniProtKB/Swiss-Prot records and the automatic annotation of UniProtKB/TrEMBL records. Other major data providers such as the National Centre for Biotechnology Information (NCBI) and sequencing and genome annotation centers such as the US Department of Energy Joint Genome Initiative (DOE-JGI) have also developed (and apply) their own guidelines. These efforts were not coordinated, which resulted in a plethora of protein name variations and inconsistencies between resources, hampering efforts to integrate and compare functional annotations provided by them.
In this SLING work, we have refined our original protein naming guidelines following consultations with representatives from a number of external resources, including the participants at the 2010 Genome Annotation Workshop of the National Centre for Biotechnology Information (NCBI) in Washington DC. The resulting updated guidelines (http://www.uniprot.org/docs/gennameprot and http://www.uniprot.org/docs/proknameprot) were subsequently adopted by a number of data providers resources including the NCBI and the International Nucleotide Sequence Database Collaboration (INSDC), and are now used as a basis for NCBI RefSeq annotations and as a recommended standard for INSDC genome submissions. Our guidelines were also accepted as a basis for common enzyme nomenclature by the Joint Commission on Biochemical Nomenclature (JCBN) of the International Union of Biochemistry and Molecular Biology (IUBMB) and the International Union of Pure and Applied Chemistry (IUPAC). The JCBN includes representatives of major providers of data on enzymes including KEGG, MetaCyc, and BRENDA, as well as UniProt, which will ensure further standardization between these resources.
The application of a single set of common nomenclature guidelines by these resources will greatly facilitate the identification and retrieval of common sequences from them, as well as the rapid adoption of new nomenclature proposals by data producers and curators and external nomenclature committees.
In parallel to these efforts we have continued to apply our shared nomenclature guidelines in the annotation of UniProtKB. Annotation is an ongoing process and involves the manual curation of new UniProtKB/Swiss-Prot records for experimentally characterized proteins, the manual update of existing UniProtKB/Swiss-Prot records in response to new recommendations and information, and the automatic propagation of annotation from UniProtKB/Swiss-Prot to unreviewed records of UniProtKB/TrEMBL by automatic annotation pipelines such as HAMAP. The HAMAP pipeline currently provides extensive functional annotation, including recommended protein names, for more than 1,700 protein families in UniProtKB, corresponding to over 1.6 million UniProtKB/TrEMBL records. Independent analyses have shown that the average accuracy of HAMAP annotations, including protein names, significantly exceeds that of the annotations provided within the original archival INSDC submissions. We are now using the HAMAP pipeline as a means to achieve further harmonization of existing nomenclature in accord with common guidelines. This involves the comparison of existing protein nomenclature from HAMAP with that provided by other resources such as the ProtClustDB and COG resources of the NCBI, as well as the KEGG orthology groups (KO groups) and TIGRFAMs. Nomenclature comparison and standardization for a core set of highly-conserved protein families will be extended to other protein families later, and will result in significant improvements to the annotations provided by these resources. New recommendations arising from these comparisons will be reflected in continuing updates to the shared protein naming guidelines provided by UniProt.
In summary, we have developed a set of protein-naming guidelines that have been agreed upon and adopted by a number of the major sequencing centres, data providers, and nomenclature committees. The on-going resolution of existing discrepancies in protein nomenclature between resources, driven by the comparison and harmonization of the automatic annotation pipelines that provide the bulk of protein nomenclature to the ever-growing body of uncharacterized sequences, will facilitate further data integration and interpretation by users. Protein nomenclature information (and other UniProtKB annotations) is provided in several formats including text, XML and RDF/XML (http://www.uniprot.org/uniprot/?query=reviewed:yes&format=*).
ENHANCING ENZYME AND ENZYME-LIGAND DATA
In the SLING project, the BRENDA enzyme database has been substantially enhanced for applications in the area of systems biology and medicine. The emphasis was on A) Complete the manually-annotated data with full sets of enzyme data by text-mining methods. B) Provide automated access to the manually-annotated data. C). Enlarge the fields which are covered by BRENDA and include new enzymes.
1. Enzyme-kinetic data (such as the turnover number and the Michaelis Menten constant KM) are highly valuable for researchers in the fields of metabolomics and systems biology. Even though BRENDA contains large amounts of kinetic data which have been manually extracted from the literature in the past 25 years, it is impossible to annotate the complete enzyme literature. Hence, text-mining methods have been developed which extract kinetic enzyme data from literature abstracts and supplement the actual data content. They are stored in a new section of BRENDA designated as KENDA (Kinetic ENzyme DAta) and can be accessed from the BRENDA website. In addition to providing lists of search results, the data can be viewed in their textual context, highlighted with respect to kinetic type, enzyme name, ligands, or source organism. The relation between enzyme (mal)function and diseases is an important issue in medical/pharmaceutic research but had not been covered by BRENDA before. To provide this information, a new text-mining method was developed which is based on a co-occurrence of enzyme names and disease terms in literature abstracts. The dependency of the enzyme-disease relation is classified into the four categories using machine learning methods which confer a clear arrangement of the data. (Task 1 of WP11)
2. Before the start of the SLING project, the kinetic data in BRENDA could only be accessed via the webforms or via a SOAP interface. The former is restricted to single types of kinetic constants (e.g. only KM values but not the corresponding kcat values) and the latter requires at least basic programming knowledge. The new SBML output functionality allows the automatic generation of a single SBML file containing the kinetic data of enzyme-catalyzed reactions of an organism. Thus, several hundred reactions and their specific data are combined in this way. In order to complete the list of kinetic data, the new data field IC50 was incorporated and populated by manual literature annotation. These data indicate the concentration of an inhibitor which results in a 50% inhibition of an enzyme. (Task 2 of WP11)
3. The BRENDA tissue ontology (available in OBO) has become a standard in the field, and is used by many researchers beyond the enzyme-related field. The ontology was expanded, branches and nodes have been added to the tree, new terms including their definitions were compiled. 1,345 new single terms and 1,179 new definitions were added. The number of new terms is significantly higher than expected, due to a high amount of new cell types, or cell lines. (Task 3 of WP11)
4. BRENDA also gives access to the enzyme 3D structures from the Protein Data Bank and to the enzyme-specific protein sequences stored in the UniProt database. With the increasing number of available 3D structures for enzymes, it has become necessary to provide methods for displaying the 3D structure-based information of active centres, sites for post-translational modification, targeting and pro-sequences, anchoring, etc. in the enzyme's 3D structure. The software for this new option was developed in the context of SLING, and utilises the protein features annotated by the UniProt project at the EBI and the SIB. A user-friendly web interface allows fast and straight forward access to the relevant enzyme-specific data in context with their structures. The user can, for example, search for certain EC numbers, enzymes of a selected organism, a specific PDB ID or UniProt accession number. Structure-function relationships can be explored e.g. to directly understand the impact that an amino acid exchange (also stored in BRENDA) or other mutations can have. Essential amino acids within active centres and spatially-adjacent (groups of) amino acids can be identified in this manner. On the other hand, interesting target points for enzyme engineering may be found by a visual inspection of the relevant regions of the proteins. Furthermore, by integrating different PDB structures in BRENDA that represent different fragments of the protein, all available structures of an enzymes can be visualized. Eighteen different types of domains and sites are available. (Task 4 of WP11)
5. The naming and classification of enzymes is essential for unambiguous storage of enzyme data in databases, and for providing links between the different types of enzyme data, e.g. linking sequence data to enzyme property data or for the presentation in metabolic networks. EC numbers are widely used in the literature and in biological databases and represent the standard reference system for enzymes. In the course of the project, 336 new EC numbers were submitted by the scientists of BRENDA and approved by the IUBMB. This number is substantially higher than anticipated (i.e. 40 new EC numbers) because the literature search and annotation process was able to be accelerated. For this purpose new software has been developed which supports several steps. (Task 5 of WP11)
CHEMICAL ENTITIES OF BIOLOGICAL INTEREST
ChEBI (Chemical Entities of Biological Interest) is the largest fully-hand-annotated, freely-available online database of molecular entities, comprising a chemical dictionary and an ontology focused on 'small' chemical compounds, in particular natural products and synthetic compounds used to intervene in the processes of living organisms. ChEBI now contains over 29,000 fully annotated entities. The main objectives of the Work Package 12 were as follows.
• To increase accessibility of ChEBI by both the biological and chemical community by removing proprietary software in the ChEBI infrastructure
• To strengthen growth, and sustainibility by improving the curation toolkit towards automatic extraction of information from the literature
The work was broken down into four main tasks, described in more detail below.
Task 1: User Survey
At the outset of the SLING project, we conducted surveys among ChEBI users (both through online questionnaires and using personal interviews with selected candidates) to help us to understand the needs of our multiple user communities (Biologists, Chemists, Ontologists, etc.), and so custom-tailor further ChEBI development (D12.01). A report analysing the results and including action points for future directions of ChEBI was produced (D12.02).
Task 2: Enhancement of Curation Tool
A text-mining feature has been added to the ChEBI curation tool which links to CiteXplore’s  text-mining infrastructure in order to mine the text of a given citation. This semi-automatically extracts relevant data from the printed literature, including patents. Suggested citations and roles related to a given chemical entity are added to a ChEBI dataset. Names, synonyms and biological/chemical roles are highlighted in the title or abstract. It is possible to filter by title, patent or by title and abstract as well as sort by text-mining score or date of publication. Inclusion of the suggested data in the final ChEBI entry is controlled by the curator.
The new feature has resulted in an increase in the number of relevant citations included in ChEBI entries, and has enabled addition of extra biological and chemical role data to the ChEBI ontology. These types of data can often be difficult and time-consuming to obtain. (D12.03)
Task 3: Removal of Proprietary Dependencies
At the start of the SLING project, ChEBI relied on a number of proprietary chemoinformatics modules which prevented the free dissemination of ChEBI's technology to the scientific community. This included the software used for chemical structure searching as well as that for the display and editing of 2D chemical structure diagrams. Our primary aim in this task was to re-engineer the ChEBI software system to implement open source alternatives to the proprietary libraries in use, so that our source code could become accessible and redistributable without licensing restrictions.
For the chemical structure editor in the public interface, ChEBI originally used Marvin (ChemAxon) , a proprietary collection of tools for drawing, displaying and characterising chemical structures and structure searches. ChemAxon also provides a software library for programmatic manipulation of chemical structures, and this was widely used throughout the ChEBI internal software suite. The initial implementation of structure searching in ChEBI used Marvin as a library for working with the chemical structures programmatically and the Chemistry Development Kit for the actual search functionality.
We have implemented the open source Oracle structure search cartridge OrChem  as the ChEBI structure search engine and JChemPaint  as the structure editor. Both of these technologies are based on the open source chemical library Chemistry Development Kit  (D12.04).
As a result of developments made to JChemPaint as part of this work, the current structure display in ChEBI (which uses Marvin) could in principle be quickly and easily replaced by non-proprietary software. Following consultation with users, however, it has become apparent that to do this currently would result in serious inconvenience to many users due to the loss of a variety of additional tools currently available as part of the Marvin tool suite. These tools include, for example, the ability to calculate and display the various ionisation states of a molecule over a broad pH range, as well as the ability to use 2-D structural data to calculate energy-minimised 3-D conformations of structures and to display the 3-D conformations from any chosen viewpoint using a variety of space-filling and ball-and-stick depictions. Although these tools were originally considered a ‘bonus’ for users of ChEBI and not regarded as essential, it has become clear that they are widely employed by ChEBI users. We have therefore decided to delay replacing the Marvin structure display until we are also able to offer non-proprietary alternatives to those tools that would be particularly missed by users if the complete Marvin suite were to be removed.
Task 4 User Workshop
A workshop was held on 13th March 2012 as part of the EMBL-EBI Industry Programme Workshop “Chemogenomics, cheminformatics and metabolomics workshop: update, strategy and workflows” at the EBI, Hinxton campus, UK. There were approximately 30 attendees from a variety of UK and overseas organisations, drawing from both industry and academia. Attendees were given an opportunity to discuss the achievements of the previous 36 months, as well as to put forward their own ideas concerning the future direction of the project (D12.05). It was noted that as ChEBI has developed, so the number of users has increased. Thus excluding traffic generated by robots, the number of unique visitors per month to the ChEBI website has increased by an average of 17% per year, from 15,150 in 2009 to 24,100 in 2012. Use of ChEBI Web Services (used for programmatic access to the ChEBI data) has increased even more dramatically, from an average of 418,000 hits per month in 2009 to 3,365,000 hits per month in 2012.
 Rijnbeek, M. and Steinbeck, C. (2009) OrChem - An open source chemistry search engine for Oracle. J. Cheminf. 1: 17.
 Krause S., Willighagen E. and Steinbeck C. (2000) JChemPaint - Using the Collaborative Forces of the Internet to Develop a Free Editor for 2D Chemical Structures. Molecules, 5: 93-98.
 Steinbeck C., Han Y.Q. Kuhn S., Horlacher O., Luttmann E. et al. (2003) The Chemistry Development Kit (CDK): An open-source Java library for chemo- and bioinformatics. J. Chem. Inf. Comput. Sci., 43: 493-500
ENHANCING PATENT DATA ACQUISITION
The European Patent Office (EPO, http://www.epo.org) offers inventors a uniform application procedure which enables them to seek patent protection in up to 40 European countries. Supervised by the Administrative Council, the Office is the executive arm of the European Patent Organisation. The main task of the European Patent Office is to examine patent applications, and to grant European patents. Besides the granting procedure, disclosure of the invention and subsequent publication are fundamental to the European patent system.
A number of patents disclose Nucleic acid and protein sequences. These biological sequences are unusual within patents in that they contain information presented in a structured manner, compared to the other unstructured parts of a patent application. Initiated under the EU FELICS FP6 programme and being continued under the FP7 SLING project, the EPO has pursued developments to improve both quality and quantity of available sequence information. In collaboration with the EBI, progress toward extraction of chemical information and cross-referencing to scientific literature have been made. Those projects are part of WP03 and WP13.
WP13 has the following three objectives:
i. The supply of an online expert sequence submission software to enable the submission of well annotated biological sequences
ii. the development of a software suite to extract chemical from patent text and images to populate public databases
iii. to develop detection algorithms for references to (citations to) literature in patent documents and populate public databases with results
Task 1: Online sequence submission software
Biological sequences including annotations are required to be presented according to a standardized format adopted by most patent offices around the world. To aid inventors and their agents, dedicated software, BiSSAP (Biological Sequence Submission Application for Patents) has been released in March 2011 and is the EPO-preferred submission tool for sequence listings.
It was initially foreseen to have a web-based expert service that would allow the user to have to benefit from a sequence validation tool online. An online service has been abandoned. The expertise has been implemented within BiSSAP as a standalone service. Numerous permanent validation rules have been implemented within the software. Those verification rules can also be applied to existing sequence listings that have been created outside BiSSAP.
BiSSAP developments continue beyond the scope of SLING. The BiSSAP 1.2 release has been made public on September 3, 2012 (See www.epo.org/bissap)
Task 2: Text mining, data extraction and database population
A platform for chemicals extraction from patent literature has been implemented at the EPO. Patents are collected from Open Patent Services (OPS), and are subject to two parallel extractions: text-mining and optical recognition. The process is as follows:
• Text-mining, using Open Source Chemistry Analysis Routines (OSCAR) enables the retrieval of organic chemical names from patent documents and store those in the ChEPO database. OCR errors in patents and false chemical naming are resolved (as often as possible) via the chemical name-to-structure conversion with OPSIN
• Optical chemical structure recognition has been improved with the EPO contribution to OSRA (v1.3.8). The structures are then resolved by similarity chemical structure search in ChEBI, PubChem and Chemspider. If the extracted structure exists in the above database, the database-ID is assigned.
Extracted data have been provided to the EBI and NCBI. NCBI has populated PubChem with a batch of chemicals relevant in the chemical patent literature. For each compound, the patent ID (along with its links to esp@cenet) is provided. Other information, e.g. patent title, sections where the chemical is found (abstract, claims, description) and paragraph number, is also provided. Over 100,000 compounds have been loaded into PubChem from the results of this work.
Task 3 : Cross referencing
Using text-mining, the aim consisted of extracting relevant cross-references to prior-art literature, and establishing hyperlinks to those publications. Extracted information should also populate a database of cross-references by the EBI.
To extract and parse citations disclosed in patent documents the EPO first created a manually-annotated set of patent documents and their citations, a so-called "Gold Standard" consisting of 70 patents with c. 4,500 citations.
This standard has been used to determine the ability and quality of the software to recognise citations in the full-text patents by comparing results found by humans and machine; it can also serve to further train the algorithms. This set was used to assess the performance of several solutions starting with regular expressions and grammars. This approach turned out to be ineffective. The simple grammars and regular expressions can to a large degree locate the *position* in the text of a citation, i.e. text containing author names, the journals titles, page numbers and publication years. However, the resolution of ambiguity between author names and article titles remains a challenge, as does finding the beginning and the end of the citation.
As a next step, other technologies like machine-learning techniques ANTLR (ANother Tool for Language Recognition) and GROBID (GeneRatiOn of BIbliographic Data) were evaluated. To assess and test these different approaches, a comparison application was developed to extract the citations and assess the performance of the tools for citation matching; the results varied between 10% and 98%. On the test set, only GROBID scores above 70%. Given the discouraging quality of results, EPO management decided that substantially-more work would be needed to achieve the final goal
EXPLOITING ELECTRONIC LITERATURE
The overall goal of SLING (Serving Life-science Information for the Next Generation) was to explore new horizons in data availability, management, and exploitation. The particular contribution of WP14 to this project goal was to investigate the role of emerging full-text resources as an integrative force for life science research. Over the period of SLING activity, there have been significant advances in the availability of full-text resources that are available for reading and re-use.
In order for full-text resources to meet their full potential within the setting of life sciences research data, there is a requirement for online literature repositories alongside other biomedical databases. These repositories need to be programmatically available, and in structured format (i.e. XML), to allow effective cross-linking between data resources. Furthermore, the growing availability of Open Access documents opens opportunities to enable text-mining approaches that move towards deeper integration. An infrastructure that supports the integration of knowledge in articles with underlying data resources will help to maximize the return on the international scientific effort and enable discovery.
This work package sought to develop established full-text repositories for biological patents and research articles; furthermore, two tasks were set to demonstrate the added value of full-text over abstracts only for integration between textual information and data resources. The main results of these tasks are ask follows:
1. A repository of full-text articles. The Europe PubMed Central website and web services are publically-available for the searching and browsing of 2.3 million full-text research articles. While development and the ongoing support of this work is funded mostly from other sources, SLING specifically contributed the development of the Open Access article FTP site, designed to support text-mining activities (for example, WP14 Task 4), and towards some of the work required for the release of the public web service, which allows programmatic search and access to all the full-text.
2. A repository of full-text patents. This was demonstrated to be possible, and was populated via the European Patent Office Open Patent Services. A repository of 1.3 million full-text patents was established, as well as a basic web service for programmatic access. While the methods developed within this task continue to be used for the retrieval of patent abstracts (also made available via Europe PubMed Central), the full-text repository is not updated at present. However, should the text-mining or scientific community express significant interest in using a full-text patent repository, this work provides an excellent basis for a sustained effort.
3. Extraction of citations from patents. Full-text patents frequently contain citations of journal articles. Europe PubMed Central already supports a citation network based on journal articles; given the availability of full-text patents, the goal of this task was to explore how citations in patents could be added to the journal citation network. This was demonstrated to be possible; however the variable quality of citation data in patents makes it very difficult to extract citations with sufficient accuracy as to be able to resolve them effectively to journal article metadata. The degree of manual intervention required to complete this task makes it prohibitive to scale up to production level. However, the experience gained during this exercise could be used to recommend "best practices" for citations to the European Patent Office, should this line of development be of future interest.
4. Extraction of post-translational modification of proteins. Curators of the UniProt database scour the literature for detailed information about proteins with which to annotate records. Given the volume of scientific articles available, developing tools that assist in filtering articles of interest is a valuable approach for maximizing the impact of curation staff efforts. This task resulted in the development of a tool that finds post-translational modifications with an accuracy sufficient to be beneficial as part of a curation pipeline. Furthermore, the tool was applied to abstracts (the long-standing target for text-mining) but also Open Access full-text articles from Europe PubMed Central (as made available in the WP14 Task1). The results derived from full-text articles suggest that they are a valuable source of post-translational modification information, providing significant new information not apparent from mining abstracts only.
To summarise WP14, the tasks within WP14 have resulted in the ongoing provision of full-text article resources that support computational approaches to literature-data integration such as text-mining. The work has informed possible future developments of full-text resources; in the case of journal articles, the value of re-using full-text has been demonstrated by the development of a text-mining based tool that supports curation work.
SLING’s mission is to enable optimal exploitation of biomolecular and related phenotypic information in all areas of commercial and public life science research. This underpins development in a wide range of key areas of crucial societal importance. The most obvious are in health and human well-being including:
• Medical device research
• Personal care
There are however key applications of the resources provided in other life science related areas such as:
• Environmental protection
Through the provision of access to this information, SLING has supported approximately 280,000 scientists in the course of the project. The EBI website alone, by the end of the project, was seeing over 3 million hits per day.
The R&D work within SLING has ensured that the data and services remain state-of-the-art.
A key aspect of the project has been training in the use of the services provided under the contract. In the course of the project SLING has provided such training at meetings and through well attended training workshops throughout Europe. We estimate that 1,100 PhD students, postdocs, staff scientists, Principal Investigators, industrial researchers, undergraduates and Masters students have been exposed to SLING training activities via 33 Roadshows throughout the project.
Whilst the core activities are in the public domain, the information is crucial to the commercial sector. Specific industry-targeted activities have provided discussion forums and training. Particular attention has been given to the importance of the information in the field of patenting, with the European Patent Office working on dissemination activities to promote best practice throughout their customer community.
The benefits persist beyond the end of the project, as all of the knowledge and materials generated will continue to be available for the foreseeable future.
The societal benefit of the project is thus of a magnitude almost impossible to measure, as all life science related endeavour depends on its output and will continue to do so.
List of Websites:
Public website address: http://www.sling-fp7.org/
Relevant contact details (Nov 2012):
European Bioinformatics Institute (EBI)
Graham Cameron (until end of Oct 2012)
Rolf Apweiler (from Nov 2012)
Tel: +44 (0)1223 494 435
Joint Associate Director
EMBL Outstation - Hinxton
European Bioinformatics Institute
Wellcome Trust Genome Campus
Graham Cameron (until end of Oct 2012)
Rolf Apweiler (from Nov 2012)
Graham Cameron (until end of Oct 2012)
Rolf Apweiler (from Nov 2012)
EMBL Grants Office
EMBL Budget Office
EBI Grants Office
Technical University of Braunschweig (TUBS)
European Patent Office (EPO)
Swiss Institute of Bioinformatics
Grant agreement ID: 226073
1 March 2009
31 August 2012
€ 10 834 375,20
€ 8 799 969
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Deliverables not available
Grant agreement ID: 226073
1 March 2009
31 August 2012
€ 10 834 375,20
€ 8 799 969
EUROPEAN MOLECULAR BIOLOGY LABORATORY
Grant agreement ID: 226073
1 March 2009
31 August 2012
€ 10 834 375,20
€ 8 799 969
EUROPEAN MOLECULAR BIOLOGY LABORATORY