Skip to main content

Integrated data and knowledge base of protein structure and sequence


The general objective is to develop an integrated European protein structure database and knowledge base. This will be achieved through the integration of data, algorithmic and knowledge based computational approaches.
Progress has been made in the design of a standard schema of object types and relations describing protein structure and sequence.

The user interface to SESAMI, ALI, contains metaknowledge about SESAM. Commands have been developed in ALI to introduce this metaknowledge interactively, thereby allowing the user to bring ALI up to date on changes made in SESAM. Further improvements have been made to the automatic query building algorithm in ALI. The MENU of ALI has been reorganized into a tree structure that reflects the hierarchic concepts of protein structure. This greatly facilitates the specification of data fields and conditions in accessing SESAM. A set of PLOT and GRAPH commands have been introduced to produce 2-dimensional graphic output (screen or printer) of any pair of numerical data fields extracted from SESAM. Information on aligned sequences of 28 protein families have been stored in SESAM. An exhaustive repertoire of sequence patterns, uniquely associated with secondary structures at specific positions, and their respective territories in all known protein structures has also been stored in SESAM.

2 algorithms to extract from the database representative sets of protein chains with maximum coverage and minimum redundancy have been developed. A sequence alignment tool has been developed which performs all pairwise comparisons and generates protein families by single linkage cluster analysis. To identify motifs, a method for generating local structural alignments has been developed.

A prototype extension to the object-oriented database has been implemented which allows relational database storage routines to be used to store objects. Progress has been made in the addition and validation of protein topology data in the TOPOL database and in the development of user interfaces.
The project is concerned with the development of an integrated protein structure environment and with the development of new computer methods for relating protein sequence, structure and function, by exploiting methods of logic programming, automatic learning procedures, and models of expert problem solving. Data, algorithmic and knowledge based computational approaches will be pooled, cross validated and applied to improve protein structure analysis, prediction methods and computer aided protein design techniques.

Among the more specific goals are the following important developments:

Deriving standard forms for manipulating and exchanging data entities among the project partners and between software developed by them or commercially.

Cross validation of data and software developed by the different partners through direct comparisons facilitated by the use of a common data scheme and storage environment. This includes development of data validation software, testing data programs and scientific methodology in the context of structure prediction and modelling.

Shared development of user interfaces (front end), efficient data storage (back end), as well as programs and methods.

Improved integration of advanced methods of database management and AI technology into the field of molecular biology.

Improved knowledge based methods for protein structure prediction.


50,Avenue F.d. Roosevelt 50
1050 Bruxelles

Participants (4)

Belgian Institute of Management SA/NV
3078 Everberg
Meyerhofstrasse 1
Imperial Cancer Research Fund (ICRF)
United Kingdom
44 Lincoln's Inn Fields
WC2A 3PX London
University College London
United Kingdom
Gower Street
WC1E 6BT London