Final Report Summary - PROMOS (Probabilistic Models in Pseudo-Euclidean Spaces)
The project of Dr. Schleif aimed to provide fast and accurate large-scale sequence search algorithms utilizing generic non-metric similarity scores. Theses scores are measures of similarities and can be collected in large, potentially sparse, matrices specifying the relatedness of the entities, e.g. sequence pairs. Alternative sources of such matrices are weighted graphs e.g. in social networks, or user votings, all using domain specific measures. Dr. Schleif derived and implemented data specific probabilistic relational models utilizing generic non-metric score similarities, efficient for large scale problems. This avoids a full comparison of a test sequence to the database or complete calculations of a matrix of score values during training and retrieval. The approach is not limited to a specific type of alignment function, but permits generic potentially non-metric measures of similarity, exemplified on protein sequence data and tested on public sequence databases. A simplified scheme of the approach is shown in Figure 1. A probabilistic framework for relational methods in pseudo-euclidean spaces was developed. New approximation schemes for relational data have been developed to approach realistic problems on the very large scale and to make the model learning and application feasible. A hierarchical model and retrieval schema was developed to permitting fast retrievals.
The new technology was tested on a variety of large scale protein databases.
Dr Schleif’s methods have already been published in a number of high ranked publications and further results are currently prepared for publication. The results clearly show that the search in large sequence databases can be substantially improved without a need for demanding computer resources or substantial loss in accuracy. Thus, Dr Schleif’s work provides conclusive evidence that probabilistic models in pseudo-Euclidean spaces are valuable methods to approach large scale problems with domain specific, non-metric, similarity measures. Additionally the approach has shown to be widely applicable as long as the used proximity measure is reasonable expressive and symmetric.