Safe and Complete Algorithms for Bioinformatics

Project Information

SAFEBIO

Grant agreement ID: 851093

Project website

DOI

10.3030/851093

Project closed

EC signature date 6 September 2019

Start date 1 March 2020

End date 28 February 2025

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 1 498 367,00

EU contribution

€ 1 498 367,00

1 498 367,00

Coordinated by

HELSINGIN YLIOPISTO
Finland

Periodic Reporting for period 4 - SAFEBIO (Safe and Complete Algorithms for Bioinformatics)

Reporting period: 2024-09-01 to 2025-02-28

Many real-world problems are modeled as computational problems, but unfortunately with incomplete data or knowledge. As such, they may admit a large number of solutions, and we have no way of finding the correct one. This issue is sometimes addressed by outputting all solutions, which is infeasible for many practical problems. This project aimed at constructing a general methodology for finding the set of all sub-solutions common to all solutions. We can ultimately trust these to be part of the correct solution. We call this set safe. The ultimate goal is creating automated and efficient ways of reporting all safe sub-solutions of a problem.

The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general. This project derived the first safe and complete algorithms for a number of fundamental problems about walks in graphs, path covers, network flows, dynamic programming. These were applied inside practical tools for genome assembly, protein alignments and pangenome analysis.

All these are very relevant at the moment, since HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.

Since the start of the project, we have made substantial progress in both theoretical and practical directions, focusing on developing algorithms that are safe, complete, and efficient. Our work has addressed key challenges in computer science and bioinformatics, leading to significant advancements across multiple research directions.

We created safe and complete algorithms for walks in directed graphs, including the development of linear-time safe and complete algorithms for Eulerian cycles and edge-covering closed walks. These algorithms are pioneering in addressing fundamental reachability problems, uncovering structural properties of the graphs that can benefit further theoretical research but also applications.

We next focused on modeling the genome assembly problem, where we introduced novel concepts like "cut paths" and "remainder structure" as tools for obtaining safe and complete algorithms. These algorithms have been integrated into popular genome assembly tools, enhancing their performance and demonstrating improvements in assembly contiguity, especially on metagenomic datasets.

Then, we tackled safe walks concerning path-finding problems with objective functions, such as Minimum Path Cover (MPC) and network flow decomposition. We achieved groundbreaking results, including the first linear-time parameterized algorithm for MPC, and developed the first efficient solutions for (minimum) flow decomposition problems via integer linear programming, significantly advancing the field.

The next results focused on pangenome graphs. We developed the tool GraphChainer, a read-to-pangenome graph aligner, and more generally explored faster solutions for string-to-graph problems. Our work on safe partial alignments for protein sequences offers promising applications in predicting stable protein structures.

Finally, we established lower bounds for our algorithms, proving their optimality and exploring variants of safety in graph and string matching problems. We also proved fine-grained complexity lower bounds for the problem of finding an occurrence of a string in a labeled graph.

Overall, our project's core methodology of outputting "safe partial solutions" has provided novel insights into algorithmic problems with multiple solutions. This approach has led to deep theoretical results and practical applications, particularly in genome assembly, where integrating safe algorithms improved assembly contiguity and incorporated data features like abundances.

Our project has made substantial progress beyond the state of the art, by introducing methodologies and achieving breakthroughs in the field of graph algorithms and bioinformatics. On the one hand, the various theoretical results show that safety can be a very interesting theoretical question, leading to deep combinatorial structures, which can in turn be exploited to obtain efficient algorithms. On the other hand, our safe and complete algorithms algorithms report more and longer safe walks than currently (heuristically) considered by practical Bioinformatics tools.

More specifically, our progress was along the following lines:

1. Development of safe and complete algorithms: We have pioneered the creation of safe and complete algorithms for fundamental graph problems, such as maximal safe walks in Eulerian cycles and edge-covering closed walks. These algorithms run in optimal linear-time. This advancement uncovers rich structural properties of graphs, offering new perspectives for further research and applications in graph theory.

2. Rigorous genome assembly: Our exploration of genome assembly problems has led to the development of optimal safe and complete algorithms based on novel concepts like "cut paths" and "remainder structure." These innovations have been integrated into popular genome assembly tools, demonstrating substantial improvements in assembly contiguity, especially on metagenomic datasets. Our work has shown significant advancements over previous heuristic approaches, providing a mathematically rigorous framework for integrating graph reachability and abundance information into the assembly problem.

3. Breakthroughs in path covers and flow decomposition: We have achieved groundbreaking results in solving the Minimum Path Cover and Minimum Flow Decomposition problems. Our linear-time parameterized algorithm for Minimum Path Cover is the first of its kind, offering efficient solutions motivated by practical applications in pangenomics. Additionally, our novel methodology for speeding up integer linear programs for flow decomposition problems, using safe paths, has resulted in significant computational improvements, with speed-ups of up to 200 times on the hardest instances (Runner-up to the Best Paper Award at the SEA 2024 conference).

4. Efficient bioinformatics programs: We have developed highly efficient tools for constructing graphs from sequencing data, such as GGCAT, which is up to 39 times faster than previous tools. Our contributions have also included solving previously conjectured NP-hard problems in linear time (eulertigs, Best Paper Award at the WABI 2022 conference), whose practical performance confirms this theoretical time complexity. We also applied the concept of safe partial solution to predicting stable protein structures, which highlights the versatility and potential of safe and complete algorithms to advance bioinformatics and related fields.

All safe walks (in green) that are part of all Eulerian circuits of the graph on the left.

Periodic Reporting for period 4 - SAFEBIO (Safe and Complete Algorithms for Bioinformatics)

Share this page Share this page on social networks

Download Download the content of the page