Safe and Complete Algorithms for Bioinformatics

Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. We aim to construct a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set safe. Ultimately, we aim at creating automated and efficient ways of reporting all safe sub-solutions of a problem.

The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general. This project will derive the first safe algorithms for a number of fundamental problems about walks in graphs, network flows, dynamic programming. We will apply these inside practical tools for genome assembly, RNA assembly and pan-genome analysis.

All these are very relevant at the moment, because HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.

Since the start of the project, we made significant progress both on the theoretical aspects of the project, namely deriving safe and complete algorithms for a range of central problems in Computer Science in Bioinformatics, and in showing under controlled experimental conditions that these safe algorithm improve over the state-of-the-art in Bioinformatics methods, in terms or accuracy, completeness, or both.

First, in order to derive efficient safe algorithm, we started from improving the state-of-the-art for solving the original versions of some of these problems. For example, in a work published in 2022 in the prestigious algorithms conference SODA, we gave the first linear-time parameterized algorithm for the classical minimum path cover problem. In practical applications where the size of a minimum path cover is small, our algorithm is the first one working in linear-time, which is a surprising result on a fundamental problem studied since the 1950s. In a work published in 2022 in the prestigious Algorithmic Bioinformatics conference RECOMB, we gave the first fast and exact solver for the standard minimum flow decomposition problem. Our solver is based on Integer Linear Programming, which has expressive flexibility, and for which there exist efficient mature solvers. This implies that future research can use our techniques to model many aspects of real data, and can implement our framework at the core of an RNA assembly Bioinformatics tool.

Second, notable safe and complete algorithms for theoretical problems include one published in 2021 in the prestigious ICALP conference, showing that all safe walks for a natural formulation of the genome assembly problem can be computed in time linear in their length, matching the time required by the popular and less complete safe walks currently used in most genome assemblers.

Third, in two works published in 2022 in the prestigious RECOMB and ESA conferences, we also implemented our safe and complete algorithms, showing that computing the safe paths for all flow decompositions in a directed acyclic graph can also be done efficiently, again improving over the length of the (subset) of safe walks currently considered in the community. All these results show that computing all safe walks for two assembly-related computational biology problems is no longer a computational bottleneck, opening the door for their integration in Bioinformatics tools.

Since safe walks in sequence graphs are also related to compression of sequencing data, we also studied several related questions. For example, in a work presented in 2022 at the Workshop on Algorithms in Bioinformatics (WABI), we showed an optimal linear-time algorithm for compressing such data, problem which was recently proposed as a computationally hard problem (NP-hard). This result obtained the Best Paper Award at WABI 2022.

On the one hand, the various theoretical results obtained so far show that safety can be a very interesting theoretical question, leading to deep combinatorial structures, which can in turn be exploited to obtain efficient algorithms. On the other hand, our safe and complete algorithms algorithms generally report more and longer safe walks than currently (heuristically) considered by practical Bioinformatics tools.

By the end of the project, we expect to develop several practical Bioinformatics tools related to the assembly problem and its variants mentioned above, using safety at their core. These can potentially exhibit a significant boost in either solution accuracy (by reporting only sub-solutions that are common to all solutions), or length (by reporting longer safe sub-solutions than the safe ones currently reported only partially).

Periodic Reporting for period 2 - SAFEBIO (Safe and Complete Algorithms for Bioinformatics)

Partager cette page

Télécharger