Continuation of the Services of the EMBL Data Library and Upgrade of the International Protein Sequence Databank


The period of this contract has been one of the most exciting phases in the history of the biological databases. The scale of the task has increased dramatically as a consequence of advances in sequencing methodologies and new partial sequencing strategies. The two partners, MIPS and the EMBL Data Library, have responded to these changes by major technical advances, which the users will see as an all-round increment in professionalism.
The Data Library has moved to EMBL's new Outstation, the European Bioinformatics Institute, has remodelled its team and developed its services to exploit modern network access methods such as the World Wide Web, both for data acquisition and for data distribution and query. A WWW server has also been implemented at MIPS.
Both groups have been involved in extensive developments in database methodologies and worked on enhancing the content of the various databases. In MIPS case the Sequence Database Definition Language has been used to provide an exhaustive syntactic and semantic definition of the database format, while in EMBL's case prototype definitions have been built using the CORBA set of standards.
Various new fields have been added to the EMBL databases, the most important of these giving the ability to build more fine-grained cross-references between the databases. MIPS have also added new information gained from their detailed curatorial work. Particular attention has been given to the careful collection and annotation of information on:
(i) the function(s) of the protein
(ii) post-translational modification(s)
(iii) residues, regions and domains of defined properties including active sites, motifs, domains as defined by homology etc.
(iv) structural information
(v) variations of the sequence representing different experimental results, mistranslations, biological diversity etc.
Scientific progress has necessitated new approaches to the overall organisation and presentation of the information. With the protein sequences, MIPS has now organised more than 60% of the sequences into families based on evolutionary related homologs. Also they have developed methods enabling users to browse for homologies and investigated data structures to provide rapid retrieval functions for sequence data.
With an ever increasing number of user sites for biological sequence information, the problem of maintaining remote copies of databases in synchrony has required careful attention. Two methods developed during the contract address this issue. A layered design using remote procedure calls (RPCs) developed at MIPS supports a rigorous description of a synchronization model, while a system of communicating transaction-logs implemented by EMBL provides a more simplistic method of synchronisation of read only copies of the database.
Both groups have carried out these developments while continuing to provide uninterrupted information services to users throughout the world. The Nucleotide Sequence Database and the PIR International Database have been delivered to schedule, and the only dissappointments have been, on the EMBL side, problems with the SWISS-PROT pipeline causing late delivery of releases, and on the MIPS side, not as much progress as had been hoped with Object Oriented Databases, due to weaknesses in commercially supplied software.
During the Contract, the PIR International collection has grown in size to over 90000 sequences, while the nucleotide collection now exceeds 600 million base pairs - about twenty percent of the size of the human genome. Interestingly, the fraction of the nucleotide data which are human has risen from 20% to over 40% (by base pairs) as the flow of EST data has surged.
MIPS and EMBL continue to work in close cooperation with their global partners in the USA and Japan, exchaning all data and updates via computer networks.
At the 1995 collaborative meeting of the Nucleotide Sequence Databases technical developments included:
extension to accession number format
experimental scheme for representing very long sequences
introduction of cross-references to external databases at level of sequence features
implementation of a common taxonomy developed by NCBI
simplified procedures for processing and exchanging data from the patent literature.
Finally, at the end of the contract, the partners MIPS and the EBI drew up a joint paper which outlined many issues which the collaboration will address in future.

