Final Report Summary - PSIMEX (Proteomics Standards International Molecular Exchange - Systematic Capture of Published Molecular Interaction Data)
Protein interaction data is generated by many different methodologies, and manuscripts describing such data are scattered across a broad spectrum of biological publications. Collating this information is the work of the many interaction databases in existence today. Most of the major curated databases have worked together since 2002 to make their content available in common data formats, annotated with the same controlled vocabularies in order to maximise data accessibility and usefulness to the scientific community. Previously, the selection of publications for entry into their respective databases was uncontrolled, often resulting in redundant selection and repetitive curation. In order to both optimise the service these databases were giving the user community, and to give maximum value for money to the funding agencies, it was agreed that curation activities should be pro-actively managed, in order to present to the user a single, non-redundant dataset consistently annotated to a uniformly high standard. These discussions lead to the formation of the International Molecular Exchange Consortium [1]. The aim of the PSIMEx grant was to enable the transition of the IMEx Consortium from exploratory to full production mode, to build an appropriate infra-structure to enable curation activities to be synchronised and to ensure curation standards were consistent across multiple distinct data resources. A mechanism had to be developed by which new partner databases, without a mature existing infra-structure, could join the consortium and also agreements and procedures had to be put in place to guarantee data was not lost if a partner ceased curation activities or left the consortium. Further to this, support was given to the standards body representing molecular interactions, the Molecular Interaction work-group of the Human Proteome Organisation Proteomics Standards Initiative (HUPO-PSI) [2] for the further development of data interchange standards required to meet evolving techniques and technologies.
As a result of the PSIMEx grant, as of 31st August 2013 the Consortium consists of 10 databases serving 284,402 interactions available from 8,067 publications. The IMEx Central database has been developed to ensure that non-redundant curation does not take place, and to issue unique identifiers to each publication. A sophisticated web-based curation tool has been built, which allows new resources to perform IMEx-level curation into an instance of the IntAct database [3], either that maintained at the EBI or locally installed by the resource. Detailed curation rules have been agreed and published, and quality control measures, both manual checks and using a validation tool containing purpose-specific rules, are in place to ensure they are adhered to. All data can be accessed at a central website or locally at the member database websites. Each partner maintains their own dataset provided through the PSICQUIC web service [4]. This web service, and the underlying tab-delimited data format, has been substantially improved by the HUPO-PSI MI group during the lifetime of the grant, to enable a fuller representation of IMEx data. All data and software has been released and is freely available to the scientific community for use under the Creative Commons Attribution License.
Finally, the PSIMEx grant funded a series of 9 workshops to promote the activities of the IMEx Consortium and teach attendees, mainly PhD students and post-docs, how to use the data in large-scale network analysis. PSIMEx-funded staff taught similar content at a further 28 EBI roadshows. Online teaching modules were made available and two tutorials were published in paper-based format in peer-reviewed publications. An instructional video was released to show new users how to use the website. An active publication strategy has also served to make known the work of the IMEx Consortium.
Project Context and Objectives:
Proteins do not operate in isolation within the cellular environment, but instead form a network of interactions with other biomolecules, existing in both stable and transient complexes with other proteins, small molecules, lipids, carbohydrates and nucleic acids. Molecular interactions are the building blocks from which pathway and computational models of the cell are created. The study of protein interactions is essential to modern Systems Biology in that they are used to add biological context to large ‘Omics experiments, such as proteomic, micro-array/RNA-Seq and GWAS studies. The output of such experiments are long lists of gene/transcripts/protein identifiers and one of the more common subsequent analyses is to paste these identifiers onto large networks of interacting proteins, specifically to look for clusters of interacting proteins within the experimental dataset. Such clusters are often due to the molecules participating in the same biological process or pathway, or being part of the same macromolecular complex. Research scientists working at the single molecule level also have an interest in the details of protein interactions, as to fully understand the biology of a single protein within the cellular environment it is necessary to know all the molecules with which it interacts, how each interaction is made and the functional consequence of each.
Protein interaction data are generated by many different techniques, including protein complementation assays such as yeast 2-hybrid, affinity techniques such as coimmunoprecipitation and pull-downs or biophysical methods such as X-ray crystallography or NMR. All of the different methods used have their strengths and weaknesses and researchers need to be able to access all aspects of an interaction experiment in order to critically assess the quality of data it has generated. Protein interactions are generated by many different laboratories around the world, with the volume of data ranging from many thousands of interactions produced by high-throughput methodologies to a single crystal structure. Collating these scattered data into central resources is the role of curated protein interaction databases. Many different protein interaction databases exist, but only a few capture experimental data either via literature curation or through direct deposition of data by bench scientists. Many of these databases are small resources which limit their curation activities to a specific area of biology. The loss of data to the public domain when any of these resources cease operation has been an area of concern, particularly following the closure of the BIND database in 2006.
Prior to 2002, interaction databases existed in isolation and exported their data in proprietary formats. In 2002, the MI work group of the HUPO-PSI was formed and published common interchange standards, which were rapidly adopted by most interaction databases and made merging of information from different resources much easier for the user. However, curation standards were still variable and the selection of which papers were to be curated was separately made by each database with the inevitable result that high-impact papers were repeatedly and redundantly entered into multiple resources. The IMEx Consortium (www.imexconsortium.org) was created to address these issues [1] and had already been in existence as a concept since 2004. Initial steps had been taken in aligning curation procedures and synchronising curation at the journal level. However moving the concept of a single, unified dataset from a theoretical ‘nice-to-have’ to an actuality was proving difficult to achieve. The PSIMEx grant gave the consortium the opportunity to realise this ambition and to involve the work of other stakeholders – data producers, instrument manufacturers and data users – in the process.
The objectives of the grant were:
• to facilitate the transition of the International Molecular Exchange consortium from explorative to full production mode;
• to co-ordinate, unify, and quality control curation standards of the IMEx partner databases;
• to maintain the PSI-MI standard and to update it in reaction to/anticipation of new scientific technologies and methodologies;
• to promote the development of high quality analysis tools for the exploitation of PSI-MI formatted data;
• to encourage users to directly submit molecular interaction data to IMEx partner databases;
• to develop efficient data flow concepts for data deposition as part of the publication process;
• to raise awareness of data management requirements as a normal part of research activities;
• to benefit from and co-ordinate with groups developing related standards, in particular BioPAX, SBML, and MAGE;
• to ensure optimal exploitation of PSI-MI data in third party databases, such as Reactome, UniProt, or SGD.
Project Results:
The IMEx Consortium
In March 2009, at the start of the grant period, 5 databases were involved in the IMEx Consortium as founder members:
IntAct (EMBL-EBI) [3]
MINT (U. Rome) [5]
DIP (UCLA) [6]
MatrixDB (U. Lyon) [7]
MIPS (HGMU) [8]
These 5 databases had previously been involved in establishing data standards through the HUPO-PSI since 2002 (MatrixDB 2006) and had already signed a memorandum of understanding committing them to jointly manage curation activities. Initial discussions aimed at aligning the curation rules of the five databases had taken place over a series of meetings and an initial plan to prevent redundant curation had been put in place. However, no workable plan for data exchange had been established.
At the start-up meeting of PSIMEx, held in Turku in 2009 [9], the terms and conditions of IMEx Membership were established and a formal document drawn up, outlining what would be expected of current and prospective members (Deliverable D1.1). Key to the agreement was that data should remain with the IMEx Consortium should a member database close, or withdraw from the Consortium. Should this occur, the IMEx data of that database would be transferred to another member for long-term maintenance and to ensure it remains in the public domain. In addition, the conditions a prospective member had to meet in order to apply for membership were agreed, and the voting mechanism by which they their membership would be decided upon was approved.
The first additional member, the Microbial Proteins Interaction database (MPIDB) [10], J. Craig Venter Institute, applied for membership at Turku. Funding had been ring-fenced for new members as part of the PSIMEx grant. MPIDB subsequently requested to be formally included on the grant and this was applied for under Amendment 1 of the grant. In the same Amendment, the EC were alerted to the move of Beneficiary 9, BindingDB [11] from the University of Maryland Biotechnology Institute, USA to University of California San Diego. Their costs were transferred to Beneficiary 3, University of California Los Angeles, and subsequently managed by internal transfer.
In 2010, the Consortium gained 3 new members – I2D (http://ophid.utoronto.ca/ophidv2.204/) [12], Molecular Connections (www.molecularconnections.com) and InnateDB (www.innatedb.ca/) [13]. It was agreed that these, and subsequent, new members would not be added to the grant but rather that the remaining funds specified for the use of new members would be managed by EMBL and used to fund travel for representatives of these resources to meetings and workshops. MIPS/HGMU resigned from both the IMEx Consortium and the PSIMEx grant. Amendment 2 formalised with the EC and outstanding funds were returned to EMBL. UniProtKB (www.uniprot.org) [14] joined the IMEx Consortium in 2012. At this point the Consortium Terms and Conditions of membership were revised to enable groups wishing to contribute records to the IMEx dataset to apply for membership provided they could partner with an existing database that would be willing to provide a curation environment and host the curated data. The final group to join the Consortium under the period of the PSIMEx grant, were MBInfo (www.mechanobio.info/)? University of Singapore), giving a total of 10 resources contributing to the IMEx dataset. It is satisfying to note that the recruitment of new members will not end with the PSIMEx grant, with a new resource (University College London Cardiovascular Gene Annotation group) currently applying for membership.
The Consortium meets annually on a formal basis. In the interests of cost saving and minimising unnecessary travel, this is usually immediately after the annual Spring workshop of the HUPO-PSI workshop and at the same venue, with the PSIMEx grant subsidising both meetings. Much of the standards work has been concentrated on the HUPO-PSI meetings, which has a broader attendance from data users and data producers.
In the lifetime of the grant, the HUPO-PSI meetings have been at
2009 Turku, Finland [9]
2010 Seoul, South Korea [15]
2011 Heidelberg, Germany [16]
2012 San Diego, US [17]
2013 Liverpool, UK [18]
*In 2010 the IMEx meeting was at Rome Italy, because many participants were not attending the Seoul workshop. The Rome meeting was unfortunately disrupted by the effects of the Icelandic volcano on air traffic, but those who could not get flights to mainland Europe attended via Webex. Reports of the HUPO-PSI workshops have been published by PROTEOMICS (Beneficiary 6) [9, 15-18], the minutes of the IMEx meetings are available on the PSIMEx website (password protected).
Data Curation
At the start of the PSIMEx grant, the existing members had agreed to ensure that redundant curation of the same paper did not occur; this was achieved by each partner fully covering the curation of a distinct set of journals. This simple data management model was followed throughout the early period of the grant, with efforts being concentrated on ensuring that entries were consistently annotated to a single set of rules (Deliverable 3.1).
1. The IMEx Curation Rules
The IMEx curation rules were established during a series of workshops, at which procedures were discussed, papers were co-curated and issues identified between meetings deliberated as a group. As a result of these meetings, a joint IMEx curation manual has been produced and made publicly available (http://www.imexconsortium.org/sites/imexconsortium.org/files/documents/imex_curation_rules_01_12.pdf). All members are expected to adhere to these rules as a minimal requirement for curation, but are free to add more stringent procedures locally.
2. IMEx Central
To manage data curation across multiple resources in the long term, the IMEx Central database (https://imexcentral.org/icentral/) has been established (Deliverable 5.2/5.3). IMEx Central implements two functions essential to collaborative curation of the interaction data by the members of the IMEx Consortium – publication tracking and accession number assignment. Both functions are available through a web-based graphical user interface or a set of SOAP-based Web services.
The publication tracker developed at DIP UCLA (Beneficiary 3), allows IMEx partners to avoid redundant curation of publications, both at the pre- and post-publication stage. For each curated publication IMEx partners enter the PubMed identifier (or internal journal identifier for manuscripts in pre-publication stage) and other relevant metadata. The public web-based interface allows users to quickly see the curation status of a publication, or to request IMEx curation of a publication (set) not yet curated. A shared accession number system (IMEx key assigner) for molecular interaction datasets provides a consistent reference system to the end user, similar to the DOI system shared by publishers. It generates, upon request, IMEx-wide interaction dataset identifiers that constitute the foundation of tracking and search system for IMEx curated interaction records.
Fig 1 Users may request curationof a specific publication through IMEx Central
Throughout the life-time of the project, IMEx Central has been updated and re-released with improvements to the interface and search facilities. A Java IMExCentral Web Service Client has been developed by EMBL and made available to all partners (http://code.google.com/p/intact/wiki/ImexCentralClient). Direct access to the service has been built into the web-based curation tool developed under the PSIMEx grant (see later), other groups either register papers in advance of curation or request IMEx numbers as part of the release process. A record page and record editor layout has been implemented and user accounts have been customised to enable opt-in to a publication watch list, an automatic watch list generation and email notifications. Help pages have also been added. It was agreed that IMEx Central should now be integrated into the IMEx webpages and will become the main route by which users request curation of specific papers.
3. The IMEx web-based editorial tool
When the PSIMEx grant was first written, the intention was to install a public IntAct-based, hosted PSI-MI 2.5 compatible curation and data analysis system at Vital-IT (Beneficiary 4) at the Swiss Institute of Bioinformatics. Further to this, a quick curation interface would be developed, allowing curators to efficiently review the results of an automated text mining system for protein interactions, and to provide the results in PSI-MI 2.5 tabular format. However, following the results of the BioCreative III text-mining competition [19], in which both relatively poor precision and recall were observed for the results of text-mining for protein interactions, it was agreed to refocus efforts on the development of a sophisticated editorial tool for manual curation (Deliverable 5.2/5.3). The editor is designed for both full IMEx-level curation, which required the capture of both full experimental data and a detailed description of the constructs used, and also for the simpler MIMIx-level [20] curation. The tool is optimally designed to communicate with a version of the open-source IntAct molecular interaction database but could also be converted to work independently from an underlying database, and directly generate PSI-MI XML and MITAB files [20] should the use-case arise.
Fig. 2 The Publication-level view of the web-based curation tool
The editor is a web application written in JAVA. It uses JavaServer Faces 2.0 (JSF) as a user-interface component based Java Web application framework and the PrimeFaces open-source JSF components suite to provide visual components. All code is made publicly available on Google Code (https://code.google.com/p/intact/wiki/Editor) along with installation instructions and a user guide.
The editor, in conjunction with the version of the IntAct database housed at the EBI, is currently being used by several external groups to fulfil a variety of needs:
1. Data providers who do not have their own interaction database or editorial tool (UniProt, Molecular Connections, MbInfo (PSIMEx stakeholders), Cardiovascular Gene Annotation group, UCL)
2. Data providers with an interaction database but no editorial tool (I2D (PSIMEx stakeholder), MatrixDB (Beneficiary 14))
3. Data providers with an interaction database and a basic editorial tool but cannot achieve IMEx-level curation with their own editorial tool (InnateDB (PSIMEx stakeholder))
The editor has an Institute Manager, which enables each of these user groups to be separately identified. External contributors can select to either have a custom download of their own (and additional) data in a defined format or take the XML files from the website for import into their own resource. All IMEx members are able to request IMEx IDs which are associated with their institute/resource, and will be displayed as part of the non-redundant IMEx dataset associated to that institute/resource. There are currently 43 active curator accounts, with data being contributed from 3 continents.
As stated above, the editor has a direct link to the IMEx Central database designed by DIP/UCLA (Beneficiary 3). This enables curators to access the database at the press of a button, and be informed whether the publication they have selected for addition to the database has already been curated by another IMEx partner, and if not, to immediately have an IMEx ID assigned to it linked to the host institute of that curator. The IMEx assigner only becomes operational when the curator actively selects the curation depth ‘IMEx’ for a record – if the curation depth is set to ‘MIMIx’ or is not selected at all, the IMEx Assigner is not visible to the curator and will not operate, thus ensuring all records assigned an IMEx ID are of the required standard of curation. Direct submissions, which are entered into the database in advance of publication and therefore have not been PubMed indexed, are assigned an IMEx ID in IMExCentral which can be associated with an internal identifier generated by the editor enabling an automatic update of the IMExCentral entry when the PubMed ID is eventually assigned.
A curation monitoring system allows both first-line curators and those tasked with checking entries before their final release to check the progress of a publication record through its pre-release life-cycle. Ownership of the record passes between the curator and checker at the press of a button and the status follows a simple colour coding system. The entry is randomly assigned a checker from a pool of senior curators. The proportion of records each curator receives can be controlled by an administrator such that the proportion can be reduced if, for example, a senior curator is absent due to illness or holiday.
Data Flow
When the PSIMEx grant was initially conceived, it was intended that there would be a regular exchange of interaction records, following the data exchange model established by the nucleotide sequence databases (EMBL/GenBANK/DDBJ) and the structural biology databases, wwPDB. However, this proved not to be practicable for detailed protein interaction records which need constant update to maintain synchronicity with the underlying protein sequence reference database (UniProtKB). It proved much simpler for each contributing database to make their data available using the PSICQUIC web service [4], effectively installing a distributed database system. Delivery of the IMEx set of interaction records to the IMEx partners and individual member database websites is achieved through a tagging process. Only IMEx partners may use the IMEx tag and only records presented in a registered PSICQUIC service tagged as an IMEx record and with an IMEx accession number can be viewed and downloaded on the IMEx website. Each IMEx partner presents their data via a PSICQUIC server, and a PSICQUIC client can query all partners for IMEx data matching a given query, providing an up-to-date view of all relevant data from all IMEx partners. In addition to ensuring that each resource was supplying the most up-to-date version of an interaction record, this also made it much simpler for partner databases and external resources to build IMEx record search and retrieval capabilities into their own websites. For those resources without their own database, and that are currently using the web-based curation tool to annotate directly into IntAct, a distinct PSICQUIC service has been established for each, so each receives full credit for their work. This is managed through the Institute Manager (see above), which links the work of each curator to a specific institute. Once a search has been made, the user may download the data directly from the website (www.imexconsortium.org) or access the original XML files at the host database website.
The first IMEx records were made available in February 2010, consisting of 28,500 records from 1365 publications made public by 2 databases. By the end of the grant (Aug 2013), the number of records had risen to 284,402 interactions available from 8,067 publications originating from 10 data resources.
Fig. 3A Increase in data flow throughout the lifetime of the grant measured by publications curated
Fig. 3B Increase in data flow throughout the lifetime of the grant measured by interactions curated
The Project Website
The project currently has 2 websites:
a. The IMEx Consortium website (www.imexconsortium.org)
This is the main point of focus for the dissemination of the data to the public and also contains public documentation such as the curation manual, announcement of training courses, news items etc. Data search and download capabilities have been built into this website.
fig. 4 The IMEx Consortium website.
b. The PSIMEx website (www.psimex.org)
This currently holds all the documentation pertaining directly to the PSIMEx grant such as minutes, deliverable and periodic reports. It is not intended to maintain this site significantly beyond the period of the grant, but key documents will be transferred to a password protected section of the IMEx Consortium website and be available to existing and future consortium members.
Both domain names were therefore purchased for an initial 10 year period with an option to renew. Both have a Drupal open source, content management system, which gave the Consortium the ability to rapidly establish the website but provides the flexibility to customise the website to meet user requirements as the collaboration matures.
As described above, IMEx uses the distributed PSICQUIC system as the basis for IMEx data dissemination to minimize the data exchange overhead. The initial release of PSICQUIC supported only the very limited set of 15 fields of PSI-MITAB 2.5 which represented a simplistic description of molecular-interaction data and failed to fully represent the richness of the IMEx curated records. In this reporting period, efforts initially concentrated on the development of a new PSICQUIC specification supporting the extended PSI-MITAB 2.6 and 2.7 formats [22], with 36 and 42 fields respectively. Specifically fields such as features on participants, for example binding sites or interacting residues, and additional structured annotation added to records such as details of agonists can now be accessed in IMEx records providing an improved IMEx service via the website. The new specification encompasses extensions to the MIQL query language and a completely new implementation of the PSICQUIC reference server. Nine of the ten current IMEx members have updated their PSICQUIC server to implement the new specification increasing the information content of the IMEx records which can be downloaded from the IMEx Consortium website.
Quality Control
The IMEx dataset is becoming known as much for the quality of the curation as for the detail within the records or the non-redundancy of the information, so it is critical for the Consortium that this is maintained. Quality control is achieved by several different mechanisms:
• Cross checking of entries between databases
• Automated rule-based entry checking
• Cross training of curators
1. Cross checking of entries between databases
Cross-checking of entries between databases has become possible using the new curation interface created as part of the PSIMEx grant (deliverables D4.1.1-4.1.2 4.3). 8 partners now use this interface (IntAct, MINT, MatrixDB, I2D, InnateDB, Molecular Connections, UniProt and MBInfo). As an integral part of the life-cycle of a publication during the curation process, each record has an obligatory requirement to be checked by a second curator prior to release. This second curator can be specifically assigned (for example during training, new curators are usually assigned to a specific Senior curator as part of the training process), or randomly assigned either from their own data resource or from an external resource. Entry checking has enabled us to identify subtle differences between curation styles between the different databases and to address these through changes to the curation rules or to the IMEx rules in the PSI validator (see below) [23].
Fig. 5 Cross-checking of entries through the web-based editorial tool
All authors whose papers are curated using the PSIMEx-funded curation interface are routinely sent an email requesting that they validate their data in the database. Response rates are low but when received, often consist of requests that additional data be added to the set or further papers be curated suggesting that the authors are satisfied with the representation of their data by the data curators.
2. Automated rule-based entry checking
The PSI validator (Deliverable 3.2 http://www.ebi.ac.uk/intact/validator) executes a set of rules based on the PSI-MI ontology to check the validity of IMEx curation into XML files [23]. This not only provides an additional check that IMEx curation rules are being adhered to by all members but also ensures that XML files are consistent between databases, easing the user –experience. It can check 5 possible levels with lower levels being included in a higher rank verification.
• XML syntax
• Usage of controlled-vocabulary defined in the PSI-MI ontology
• PSI-MI basic checks
• MIMIx-compliance
• IMEx-compliance
IMEx rules are agreed by discussion between curators and reflected in a concomitant update of the IMEx curation manual. Plans for the near future include incorporating the syntax checker directly into the PSIMEx-funded editorial tool, such that curators are informed immediately when entries fail any of these checks – currently issues are only raised during release builds and are manually fixed as part of the release process.
3. Cross training of curators
The communication between curators has been most effective in ensuring consistent working practices. This has been achieved on a daily basis via a common email list, which enables the discussion of problem cases or required rule changes, through the curator visits from one institute to another and through curator workshops, the latter two being funded by the PSIMEx grant.
Standards Development
Integral to all this work has been the use and development of the HUPO-PSI data standards. The further development of those standards to expedite the release of the IMEx dataset was a key part of the PSIMEx grant. Under the terms of the IMEx agreement, and to enable the use of the PSICQUIC web service to present the IMEx dataset to the user community, all participating databases must agree to use the agreed version of the PSI-MI data standard. However, the techniques and technologies utilised to generate interaction data become more sophisticated with time and standards and formats must progress in parallel. The PSI-MI suite of data resources include
• An XML format in which to represent data
• A simpler, tab-delimited format
• A controlled vocabulary with which to annotate the format.
• The PSICQUIC web service for users to access and query the data
The simplest method for responding to new techniques is by adding additional terms to the controlled vocabulary, the richest branches of which describe the interaction detection method and participant detection methods. The IMEx curators are both the heaviest users and most active contributors to the continued development of the grant with over 100 new terms requested over the life-time of the grant. New branches such as the ‘curation quality’ and ‘curation content’ were also added to enable IMEx records to be differentiated from other levels of curation.
IMEx records are richly annotated with a high degree of detail. As previously described, the PSICQUIC web service is used to make IMEx records publicly available as a single, non-redundant set. PSICQUIC is based on the MITAB format, and not PSI-MI XML. At the start of the grant, when the process was first established, PSICQUIC was based on MITAB2.5 in which the data is described in a very limited set of 15 fields. In order to present a better representation of the IMEx data to the user community, a new PSICQUIC specification was developed based on the much richer, 42 field MITAB2.7 format. A prototype version of PSICQUIC based on XML has also been developed. The system requires extensive testing and statistics on query time and indexing have yet to be gathered but the ability of XML to deal with multi-protein complexes is an obvious advantage of this system. The code is available for testing and input (https://code.google.com/p/psicquic/source/browse/trunk/psicquic-solr-server).
Under the PSIMEx grant, provision was made for the further development of the existing PXI-MI XML standard (2.5.4) to a version 3.0. At each HUPO-PSI workshop, the usage of the PSI-MI XML has been reviewed and new use cases examined. In each case, it was found the version 2.5.4 was capable of meeting requirements so resources were concentrated on the further development of PSICQUIC. In April 2013, new cases were identified which could not be met by the current standard [17]. Specifically, these new cases include:
• Cooperative and allosteric interactions – binding of one molecule influences the subsequent binding of a second [24]
• Abstracted information which is not directly linked to an experiment, for example an abstracted protein complex
• Protein sets as interactors, for example in affinity mass spectrometry experiments in which an identified peptide can be matched to multiple proteins
• The removal of elements of the existing schema which are unused, for example the experiment-reference
The decision was made that it was time to develop version 3.0. This is not an immediate process – the development of a community standard requires a long period of consultation before development and testing can begin so it will not be achieved in the life-time of the grant. However an initial proposal document has been written and the consultation initiated
https://docs.google.com/document/d/1b4lrUI7PH5uU2rcLMGTZtCf3k8bfjAsqlQyGKKuJN7U/edit
This work will be concluded beyond the lifetime of this grant.
Additional activities
1. Implementation of Standard Formats as an Instrumentation read-out
One possible scheme for easing the route from experimental read-out to interaction database, is to have one or more of the PSI-MI formats be the direct read-out from experimental instrumentation. This was investigated by looking at the possibility in Biacore systems, from GE Healthcare , (Beneficiary 7) utilizing surface plasmon resonance (SPR) based technology for studying biomolecular interactions in real time. This methodology was particularly appropriate for this exercise as it is one of the very few molecular interactions techniques with an objective numeric read-out from an analytical instrument – in most experiments the end point is often a relatively subjective identification of a participant, or list of participants. Biacore instruments already support data export in XML format. There is, however, a lot of information missing that is required in the PSI-MI format. For example, for a record to be IMEx-compatible, the binding partners of the interaction must be named unambiguously by referencing an entry in a public database such as UniProtKB. In Biacore systems output this meta data is entered as free text by the user when setting up the experiment and will require conversion from free text. The evaluation looked at 3 possibilities:
1. a stand-alone user interface where the user enters the result manually
2. a stand-alone user-interface communicating with the Biacore evaluation software
3. PSI-MI XML2.5 becomes an integrated part of Biacore evaluation software
GE Healthcare had already invested substantially in their own software development for Biacore systems and was not able to embark on any route which required a substantial change, and everyone agreed that alternative 3 was not an option. It was also important to consider that the data the researcher wishes to have as an end point (e.g. rate and affinity constants, ka, kd and KD), is frequently based on an average of several experiments. Such a procedure does not comply with the results obtained in an initial Biacore experiment, which are generated from one experiment. This clearly limits the value of a formatted export direct from the instrumentation. For this reason, a standalone user interface where results can be entered manually, be it an average of several experiments or a single value, would be the most flexible solution. Such a tool would also be independent of the type of Biacore system that was used to generate the data. The investigation then went on to produce a list of SPR related parameters to be submitted into the database, as discussed with an experienced Biacore user and PSIMEx partner, Prof. Sylvie Ricard-Blum (Beneficiary 13),CNRS-University of Lyon, France. A table mapping Biacore data to MIMIx and IMEx level data and a mock-up of a potential graphical user interface were then produced. At this point, the report was sent to GE Healthcare – to the knowledge of those involved with the PSIMEx grant there has been no further commitment by the company on these proposals.
Fig. 6 The proposed Biacore-PSI user-interface
2. Common Interaction Scoring System
Molecular interaction data are generated by many different methodologies, all of which have their strengths and weaknesses, and all of which are capable of generating false positive data as well as ‘true’ interactions. It is therefore common practice to score interaction evidences, to enable the filtering out of false positive interactions, which should score poorly in comparison to biologically valid interactions. There are many such methods available; some rely on comparing the interaction data with orthogonal data (for example with the corresponding gene annotation of a pair of apparently interacting proteins), others on an assessment of the experimental data, or by text-mining of the number of publications in which the pair have appeared in the same sentence. It is, therefore, very difficult to adopt a common scoring schema that combines all these different approaches. For this reason, the PSISCORE uses a decentralized setting, where individual scoring servers apply their specific scoring methods for assessing diverse biological and methodological aspects of interaction data. PSISCORE is open-source and a distributed-system [4]. The PSISCOREweb (http://psiscore.bioinf.mpi-inf.mpg.de/) is a simple web-based PSISCORE client. The start and end point of a PSISCORE use case is a set of molecular interactions in a HUPO-PSI-defined file format (i.e. MITAB or PSI-MI XML 2.5). A PSISCORE client such as PSISCOREweb sends the MI file to multiple scoring servers. After the requested scoring computations have been performed, all the calculated scores are added to a downloadable output file.
One of the many advantages of all databases utilising the same data formats, and same controlled vocabularies, is that scoring methods looking at the amount of experimental evidence in existence for two molecules interacting can easily be developed. The flexibility of the controlled vocabularies is such that this method can be used to include, or exclude, predicted and inferred interaction data. One such method is miscore, now routinely used by the IntAct database (jimenez et al in preparation). In this method, all interaction evidences for two molecules interacting are merged by using a predefined set of database identifiers and cross-references (mimerge). The merged interactions are then sub-scored on the methodology by which this evidence is generated, the type of interaction observed and the number of publications from which experimental data has been taken. By default Miscore presents a normalized score between 0 and 1 reflecting the reliability of its combined experimental evidence. This score is calculated weighting sum of the three different subscores, each of which is also represented by a score between 0 and 1. The importance of each variable in the main equation can be adjusted using a weight factor.
Fig. 7 The miscore system
The user is then presented with a set of interacting pairs of molecules, each with an interaction confidence score between 0 and 1 and can then choose a filter above which they are willing to consider their interaction dataset as ‘True’.
3. Protein identifier mapping system
One of the more time-consuming problems facing anyone working with interaction data is the issue of identifier mapping. There are multiple possible ways that can be used for identifying molecules – for example in the case of proteins, these can include gene/protein names (with or without an accompanying indication of species) or accession numbers from any one of UniProt, Ref-Seq, ENA/GenBank/DDBJ, Ensembl, or model organism databases. Protein identifier mapping services exist, and an assessment of each was carried out to decide on which to recommend for IMEx curation (Deliverable 4.4). The conclusion was that no one service could fulfil curator needs but a combination of the UniProt ID mapping service (http://www.uniprot.org/?tab=mapping) and the Protein Identifier Cross-Reference Service (www.ebi.ac.uk/Tools/picr/?) [25] at the EMBL-EBI could deal with most use cases.
4. Relationship with journals
One important aspect of the PSIMEx grant was to improve the submission of molecular interaction data prior to publication of a manuscript, rather than databases being reliant on the less efficient process of archival curation of post-publication material. Two journals were involved with the PSIMEx grant and have not only supported us by publication of papers and meeting reports, and both have adapted their instructions to authors to encourage this process. For example, all Nature journals encourage authors to submit:
“For protein interaction data: IMEx consortium of databases including DIP, IntAct and MINT”. The online Nature journal, Molecular Systems Biology is particularly rigorous in following up on this. Other journals have been slow to follow suit, but the insistence on public domain data deposition by the UK Research Councils is starting to show positive benefits in this area.
Further to this, one Consortium member, MINT, produces structured digital abstracts (SDAs) summarising with database identifiers and predefined controlled vocabularies, the protein interactions reported in the manuscript [26]. The papers are sent to the MINT curators following acceptance, the interactions are entered into the database and the SDAs computationally generated from that information.
Example of an SDA
Structured summary of protein interactions
PIAS2 physically interacts with RACK1 by two hybrid (View interaction)
RACK1 and PIAS2 colocalize by fluorescence microscopy (View interaction)
PIAS2 physically interacts with RACK1 by two hybrid pooling approach (View interaction)
PIAS2 and RACK1 colocalize by fluorescence microscopy (View interaction)
PIAS2 physically interacts with RACK1 by anti bait coimmunoprecipitation (View Interaction: 1, 2)
Data generated as part of the HUO Human Proteome Project (HPP) must also be submitted to an IMEx database prior to publication before it can claim to be a HPP dataset, irrespective of the journal the group subsequently chooses to publish in.
References
1. Orchard et al. Nat. Methods 2012, 9(4):345-350
2. Orchard et al. Biochim Biophys Acta 2013, DOI: 10.1016/j.bbapap.2013.03.011
3. Kerrien et al. Nucleic Acids Res 2012, 40(database issue):d841-846
4. Aranda, B. et al. Nat Methods 2011, 8(7):528-529
5. Licata et al. Nucleic Acids Res 2012, 40(database issue):d857-861
6. Xenarios I. et al. Nucleic Acids Res 2002, 30: 303-305
7. Chautard E. et al. Nucleic Acids Research 2011, 39(Database issue):D235-240
8. Mewes, H. et al. Nucleic Acids Research 2011, 39(Database issue):D220-224
9. Orchard et al. Proteomics 2009, 9(19):4426-4428
10. Goll. J. et al. Bioinformatics 2008, 24(15):1743-1744
11. Liu T.et al. Nucleic Acids Res 2007, 35(database issue):d198-201
12. Niu Y. et al. Bioinformatics 2010, 26(1):111-119
13. Breuer K. et al. Nucleic Acids Res 2013, 41(database issue):d1228-33
14. UniProt Consortium Nucleic Acids Res 2013, 41(database issue):d43-47
15. Orchard, S. et al. Proteomics 2010, 10(17):3062-3066
16. Orchard, S. et al. Proteomics 2011, 11(22):4284-4290
17. Orchard, S. et al. Proteomics 2012, 12(18): 2767-2772
18. Orchard, S. et al. Proteomics 2013, 13(20): 2931-2937
19. Chatr-Aryamontri A. et al. BMC Bioinformatics 2011, 12 Suppl 8:s8
20. Orchard, S. et al. Nat Biotechnol 2007, 25(8):894-898
21. Kerrien, S et al. BMC Biol 2007, 5:44
22. Del-Toro, N. et al. Nucleic Acids Res 2013, 41(web server issue):w601-606
23. Montecchi-Palazzi L. et al. Proteomics 2009, 9(22):5112-5119
24. Van Roey K, et al. Database (Oxford) 2013, 2013:bat066
25. Wein SP et al. Nucleic Acids Res 2012, 40(web server issue):w276-8
26. Ceol A. et al. FEBS Lett 2008, 582(8):1171-1177
27. Orchard S. et al. Curr Protoc Protein Sci 2010, Chapter 25:unit 25.3
28. Koh, G. et al. J Proteome Res 2012, 11(4):2014-2031
29. Franceschini A. et al. Nucleic Acids Res 2013, 41:d808-815
30. Razick S, Magklaras G, Donaldson IM. BMC Bioinformatics 2008, 9:405
Potential Impact:
1. The Economic Impact
The main drivers behind the formation of the IMEx Consortium (www.imexconsortium.org) were to
• Provide the optimal service to the interaction data community by supplying a non-redundant set of data consistently annotated to the same high standards
• Give the respective funders of the various collaborating databases the best return on their investment, in that they would always be under-writing new curation and not the repeated collation of data which already exists in a similar resource.
The IMEx dataset is now contributed to by 10 databases, with an 11th currently applying for membership, with only 4 of these having received direct PSIMEx funding, and a further 4 receiving some travel funds as stakeholder members. The infrastructure and support form contributing resources built under has PSIMEx grant has proven a popular organisational model and will ensure that the Consortium will exist well beyond the lifetime of the grant. Central to this has been
1. The web-based editorial tool which, as of August 2013 is used by a total of 11 resources, 9 of which are already members of IMEx. The editor has enabled resources such as UniProtKB which does not have an in-house interaction database, to contribute to the public domain interaction dataset. Additionally, specialised resources such as MatrixDB which have only limited funding, do not have to spend valuable developer resource in producing their own editorial tool, but instead curate into IntAct and import the resulting records into their own local resource.
2. The development of IMEx Central to manage annotation across multiple resources. This enables the curators to immediately identify papers which have already been curated, reserved, or rejected as impossible to curate by other curators working at disparate resources. For the majority of curators, working with the PSIMEx editorial tool, this is an immediate process as the editorial tool communicates with IMEx Central every time a new paper is initiated. Databases that have not built the web service into their editor will normally perform this in batch mode, via the web interface.
There is an obvious economic benefit to smaller databases which are not only spared the expense of developing their own curation and data management pipelines but also receive training and quality control input from established resources such as IntAct. This enables them to concentrate their limited resources on curation, with the result that a non-redundant set of interactions from over 8000 publications have now been added into the public domain, and on the improvement of their own websites which are often serving a very specific user community. These include data resources such as MatrixDB, which specialises in the biology of the extracellular matrix, and MBInfo, an educational resource aimed at a deeper understanding of cellular cytoskeleton.
2. Dissemination Activities
Dissemination activities have followed two main routes
a. Publication and conference presentation
b. Training workshops and elearning modules (WP6)
Publication and Conference Presentation
The seminal publication “Protein interaction data curation: the International Molecular Exchange (IMEx) consortium” jointly authored by all the IMEx consortium databases appeared in Nature Methods in April 2012 [1] and has already received 35 citations (Google Scholar, Aug 2013). However, prior to this, member databases had been describing their role within the IMEx Consortium within their respective publications – as these are well used resources such publications are also well read and well cited (for example, the IntAct 2012 reference paper [3] with 205 Google Scholar citations and MINT 2012 with 94) [4].
There have been many conference presentations (see below), with the work of the IMEx Consortium having regularly been described at HUPO Congress workshops, BioCurator Congresses, network analysis and data visualization meetings, and bioinformatics/computational biology conferences. Over this time period researchers directly or partially funded by this grant have spoken at over 50 international meetings and workshops.
Training workshops and elearning modules
A major part of the PSIMEx proposal was the use of training workshops and publication/elearning based courses to disseminate the work of the Consortium and ensure full utilisation of the IMEx dataset.
Nine workshops have been fully funded or subsidised by the PSIMEx grant
1. Course Title: Interactions & Pathways: towards a whole system perspective
Organizers: EMBL-EBI (Beneficiary 1)
Date: 15-18th June 2009
Venue: EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge
2. Course Title: PSIMEx Workshop: Interactions and Pathways
Organizers: EMBL-EBI (Beneficiary 1)
Date: 29th March-1st April 2010
Venue: EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge
3. Course Title: “PSIMEx Workshop: Interactions and Pathways”
Organizers: Centre National de la Recherche Scientifique, Université Lyon 1 (Beneficiary 13)
Date: 27-29th September 2010
Venue: Institut de Biologie et Chimie des Protéines
UMR 5086 CNRS - Université Lyon 1
4. Course Title: “PSIMEx Workshop: Interactions and Pathways”
Organizers: EMBL-EBI (Beneficiary 1)
Date: 29th March-1st April 2011
Venue: EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge
5. Course Title: BIOINFORMATICS AND NETWORK BIOLOGY: from genomic data to profiles, networks and pathways
Organisers: Centro de Investigación del Cancer (Beneficiary 10)
Date: 22-24th June 2011
Venue: Centro de Investigación del Cancer (CiC-IBMCC, CSIC/USAL), Salamanca, SPAIN
6. Course Title: PSIMEx Workshop: Interactions and Pathways
Organisers: Swiss Institute of Bioinformatics (Beneficiary 4)
Date: October 6-7, 2011
Venue: Lausanne, Switzerland
7. Course Title: PSIMEx Workshop: Networks and Pathways Bioinformatics for Biologists
Organisers: EMBL-EBI, Hinxton, Nr Cambridge, CB10 1SD, UK (Beneficiary 1)
Date: 16 - 18 May 2012
Venue: EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge
8. Course Title: Small/medium Enterprises Bioinformatics Forum, Barcelona
Organisers: EMBL-EBI (Beneficiary 1), Research Programme on Biomedical Informatics (GRIB) and Biocat, the cluster organization that coordinates biotechnology and medical technologies in Catalonia, Barcelona Biomedical Research Park, PRBB C/ Doctor Aiguader 88, Barcelona, 08003, Spain
Date: 8-9th October, 2012
9. Course Title: PSIMEx Workshop: Networks and Pathways Bioinformatics for Biologists
Organisers: EMBL-EBI, Hinxton, Nr Cambridge, CB10 1SD, UK (Beneficiary 1)
Date: 8 - 12 July 2013
Venue: EMBL-EBI, Wellcome Trust Genome Campus, Hinxton Cambridge
With the exception of the ‘Small/medium Enterprises Bioinformatics Forum, Barcelona’ (which was purely lecture-based), each of these courses involved teaching 10-40 PhD students and post-Docs to use interaction data and network analysis techniques to interrogate large datasets such as the output from GWAS, micro-array, RNA-Seq, and proteomics studies. The courses were a mixture of lectures and hands-on practical sessions with attendees given a proteomics dataset to work through the various analytical techniques taught on the course. Network data was increasingly supplied by IMEx as the PSIMEx grant progressed and more data became available. Course feedback was collected and used to influence and improve subsequent courses – for example, the addition of sessions on computational access to data and also a chance for students to work on their own datasets with tutors present were added to the 3-4 day courses as a result of this. Participant feedback was systematically captured using SurveyMonkey with the students being given the opportunity to add free text comments.
Fig. 8 A session at the 2013 Interactions and Pathways course held at EMBL-EBI
In addition to the above training courses, an Interactions and Pathways module is offered as a standard option when institutions request an EMBL-EBI workshop to be hosted at their own institution. The Interaction and Pathways module includes an introduction to data standards and data sources, in which the use of the IMEx dataset is highlighted. Although PSIMEx has not subsidised these workshops (many were FP7 SLING 226073-funded), the role of PSIMEx in funding the underlying data has been emphasized at each of these events and in most cases the person delivering the Interactions and Pathways module was a PSIMEx-funded staff member. A total of 28 roadshows included the Interactions and Pathways module in the period of the PSIMEx grant, in Europe, North America and Australia.
Elearning modules have been written as part of the EMBL-EBI trainonline suite of courses (http://www.ebi.ac.uk/training/networks) with the relationship between IntAct and the IMEx Consortium being explained in the IntAct: Molecular Interactions at the EBI module.
Additionally, an instructive video, telling users how to navigate the website is available on the IMEx website (www.imexconsortium.org) and two tutorials have been published in conventional journals in the lifetime of this grant:
1. The publication and database deposition of molecular interaction data. Orchard S, Aranda B, Hermjakob H PMID:20393973 Current Protocols Protein Science 2010 (Chapter 25) page info:unit 25.3 [27]
As the title suggests this tutorial concentrates on the publication of IMEx-compliant data using a variety of tools, or example a custom Excel sheet which uses embedded macros to access the PSI-MI controlled vocabulary terms to annotate the data. A tool has been developed which enables rapid upload from this spreadsheet to an IMEx-member database.
2. Analyzing protein-protein interaction networks. Koh, GC , Porras P, Aranda B, Hermjakob H, Orchard SE PMID:22385417 Journal Proteome Research 2012 (11) page info:2014-31 [28]
This paper concentrated on data analysis – specifically using data from molecular interaction database IntAct (Beneficiary 1), the software platform Cytoscape, and its plugins BiNGO and clusterMaker, and taking as a starting point a list of proteins identified in a mass spectrometry-based proteomics experiment, the tutorial demonstrates how to build, visualize, and analyze a protein–protein interaction network.
Exploitation of results
The most visible users of the IMEx dataset are those who have reused the data in their own web-based resources. In addition to the members of the Consortium, who include the ability to search additional data held by other IMEx partners on their own websites, the web service is used by other data resources that wish to either incorporate high-quality interaction data in their analysis pipeline, or make subsets of the IMEx data readily available for the user. An example of the first of these is the COPa Knowledgebase (COPaKB). COPaKB (http://www.heartproteome.org/copa/) is developed under NHLBI Proteomics Centers Program and has been created to facilitate understanding of novel biological insights from proteomic datasets. COPaKB supports investigators in processing raw proteomic datasets without the need of accessing high-end instrumentation, and returns a consistently annotated report of protein properties. COPaKB is configured in a modular structure according to the organellar origin. Currently, there are ten modules including the human heart mitochondria, proteasome and total lysate, the murine heart mitochondria, proteasome, cytosol, nuclei, and total lysate, drosophila mitochondria, and C. elegans mitochondria. Users select an interactome view of their proteome module of interest. The protein-protein interactions within this pre-computed map have been retrieved from the IMEx databases. In this map, proteins are clustered according to their Reactome pathways and/or gene ontology annotations. Users can selectively review a protein cluster by putting a checkmark in the relevant box or by highlighting a member of this cluster.
Fig. 9 Proteins involved in Protein Metabolism highlighted on a human proteasome interactome built in COPaKB using IMEx data
When a protein identifier or MS data file(s) is used to query COPaKB, relevant protein(s) in this map are automatically highlighted.
A second example, mentha archives evidence collected from manually curated IMEx protein-protein interaction databases (http://mentha.uniroma2.it/about.php). The aggregated data forms an interactome for a selected set of model organisms. Having created these interactomes, mentha offers a series of tools to analyse selected proteins in the context of a network of interactions. The IMEx dataset is has also been used, and combined with additional data in secondary databases such as STRING [29] and iRefIndex [30].
List of Websites:
www.imexconsortium.org