## Final Report Summary - PROMOS (Probabilistic Models in Pseudo-Euclidean Spaces)

With projects like the genome sequencing initiative or its successor, huge amounts of sequence data are available and database sequence similarity searches (retrieval) have become a central tool in bioinformatics to identify potentially homologous sequences. Nowadays such databases are widely used as an initial step for sequence characterization and annotation, phylogeny, genomics, transcriptomics, and proteomics studies. Prominent public databases are GenBank with ≈ 135.000 entries or UniProt/SwissProt with ≈ 500.000 sequences. Clas-sically, a query of a sequence to such a database requires the comparison of the query to each entry using an alignment algorithm, like fasta, Smith- Waterman or blast. Heuristic algorithms have been designed to reduce the time required to build an alignment, but still a linear complexity of the search remains, posing a big challenge for growing large sequence databases. The search produces a list of similar sequences and local alignments which have to be carefully examined. Current approaches lack a principled, formalized model to overcome the insufficient retrieval quality.

The project of Dr. Schleif aimed to provide fast and accurate large-scale sequence search algorithms utilizing generic non-metric similarity scores. Theses scores are measures of similarities and can be collected in large, potentially sparse, matrices specifying the relatedness of the entities, e.g. sequence pairs. Alternative sources of such matrices are weighted graphs e.g. in social networks, or user votings, all using domain specific measures. Dr. Schleif derived and implemented data specific probabilistic relational models utilizing generic non-metric score similarities, efficient for large scale problems. This avoids a full comparison of a test sequence to the database or complete calculations of a matrix of score values during training and retrieval. The approach is not limited to a specific type of alignment function, but permits generic potentially non-metric measures of similarity, exemplified on protein sequence data and tested on public sequence databases. A simplified scheme of the approach is shown in Figure 1. A probabilistic framework for relational methods in pseudo-euclidean spaces was developed. New approximation schemes for relational data have been developed to approach realistic problems on the very large scale and to make the model learning and application feasible. A hierarchical model and retrieval schema was developed to permitting fast retrievals.

The new technology was tested on a variety of large scale protein databases.

Dr Schleif’s methods have already been published in a number of high ranked publications and further results are currently prepared for publication. The results clearly show that the search in large sequence databases can be substantially improved without a need for demanding computer resources or substantial loss in accuracy. Thus, Dr Schleif’s work provides conclusive evidence that probabilistic models in pseudo-Euclidean spaces are valuable methods to approach large scale problems with domain specific, non-metric, similarity measures. Additionally the approach has shown to be widely applicable as long as the used proximity measure is reasonable expressive and symmetric.

The project of Dr. Schleif aimed to provide fast and accurate large-scale sequence search algorithms utilizing generic non-metric similarity scores. Theses scores are measures of similarities and can be collected in large, potentially sparse, matrices specifying the relatedness of the entities, e.g. sequence pairs. Alternative sources of such matrices are weighted graphs e.g. in social networks, or user votings, all using domain specific measures. Dr. Schleif derived and implemented data specific probabilistic relational models utilizing generic non-metric score similarities, efficient for large scale problems. This avoids a full comparison of a test sequence to the database or complete calculations of a matrix of score values during training and retrieval. The approach is not limited to a specific type of alignment function, but permits generic potentially non-metric measures of similarity, exemplified on protein sequence data and tested on public sequence databases. A simplified scheme of the approach is shown in Figure 1. A probabilistic framework for relational methods in pseudo-euclidean spaces was developed. New approximation schemes for relational data have been developed to approach realistic problems on the very large scale and to make the model learning and application feasible. A hierarchical model and retrieval schema was developed to permitting fast retrievals.

The new technology was tested on a variety of large scale protein databases.

Dr Schleif’s methods have already been published in a number of high ranked publications and further results are currently prepared for publication. The results clearly show that the search in large sequence databases can be substantially improved without a need for demanding computer resources or substantial loss in accuracy. Thus, Dr Schleif’s work provides conclusive evidence that probabilistic models in pseudo-Euclidean spaces are valuable methods to approach large scale problems with domain specific, non-metric, similarity measures. Additionally the approach has shown to be widely applicable as long as the used proximity measure is reasonable expressive and symmetric.