Scientists generally utilise database sequence similarity searches for data retrieval. However, public databases such as GenBank and UniProt/SwissProt contain several hundred thousand sequences, and existing bioinformatics techniques cannot achieve good data retrieval quality. The PROMOS (Probabilistic models in pseudo-Euclidean spaces) team addressed this breach in bioinformatics approaches. Their goal was to devise algorithms that rapidly provide accurate sequence data from large-scale databases. To begin with, researchers employed generic non-metric score similarities to derive and implement data-specific probabilistic relational models. They successfully developed a probabilistic framework for relational methods in pseudo-Euclidean spaces. To enhance model learning and enable fast data retrieval, they developed approximation schemes for relational data as well as a hierarchical model and retrieval schema. This domain-specific approach is effective as it converts large-scale dissimilarity matrices into approximated positive semi-definite kernel matrices at linear costs. PROMOS technology was tested on several large-scale protein databases and demonstrated better run-time performance than classical retrieval systems with competitive model accuracy. The methods have been published in numerous highly ranked publications, with several more under preparation. Project activities and outcomes should considerably speed up research and development in the biotechnology and pharma sectors.
Sequence data, bioinformatics, PROMOS, probabilistic models, pseudo-Euclidean spaces