Periodic Reporting for period 4 - SAFEBIO (Safe and Complete Algorithms for Bioinformatics)
Okres sprawozdawczy: 2024-09-01 do 2025-02-28
The main motivation of this project comes from Bioinformatics, in particular from the analysis of high-throughput sequencing (HTS) of DNA. One of the main applications of HTS data is to assemble it back into the original DNA sequence. This genome assembly problem admits many solutions, and current research has indeed considered outputting only partial solutions that are likely to be present in the correct original DNA sequence. However, this problem has been approached only from an experimental point of view, with no definite answer on what are all the safe sub-solutions to report. In fact, the issue of safe sub-solutions has been mostly overlooked in Bioinformatics and Computer Science in general. This project derived the first safe and complete algorithms for a number of fundamental problems about walks in graphs, path covers, network flows, dynamic programming. These were applied inside practical tools for genome assembly, protein alignments and pangenome analysis.
All these are very relevant at the moment, since HTS goes from research labs to hospitals, and we need answers that are first of all accurate. Our approach changes the perspective from which we address all real-world problems, and could spur a new line of research in Computer Science/Bioinformatics. The grand aim is a mathematical leap into understanding what can be safely reported from the data.
We created safe and complete algorithms for walks in directed graphs, including the development of linear-time safe and complete algorithms for Eulerian cycles and edge-covering closed walks. These algorithms are pioneering in addressing fundamental reachability problems, uncovering structural properties of the graphs that can benefit further theoretical research but also applications.
We next focused on modeling the genome assembly problem, where we introduced novel concepts like "cut paths" and "remainder structure" as tools for obtaining safe and complete algorithms. These algorithms have been integrated into popular genome assembly tools, enhancing their performance and demonstrating improvements in assembly contiguity, especially on metagenomic datasets.
Then, we tackled safe walks concerning path-finding problems with objective functions, such as Minimum Path Cover (MPC) and network flow decomposition. We achieved groundbreaking results, including the first linear-time parameterized algorithm for MPC, and developed the first efficient solutions for (minimum) flow decomposition problems via integer linear programming, significantly advancing the field.
The next results focused on pangenome graphs. We developed the tool GraphChainer, a read-to-pangenome graph aligner, and more generally explored faster solutions for string-to-graph problems. Our work on safe partial alignments for protein sequences offers promising applications in predicting stable protein structures.
Finally, we established lower bounds for our algorithms, proving their optimality and exploring variants of safety in graph and string matching problems. We also proved fine-grained complexity lower bounds for the problem of finding an occurrence of a string in a labeled graph.
Overall, our project's core methodology of outputting "safe partial solutions" has provided novel insights into algorithmic problems with multiple solutions. This approach has led to deep theoretical results and practical applications, particularly in genome assembly, where integrating safe algorithms improved assembly contiguity and incorporated data features like abundances.
More specifically, our progress was along the following lines:
1. Development of safe and complete algorithms: We have pioneered the creation of safe and complete algorithms for fundamental graph problems, such as maximal safe walks in Eulerian cycles and edge-covering closed walks. These algorithms run in optimal linear-time. This advancement uncovers rich structural properties of graphs, offering new perspectives for further research and applications in graph theory.
2. Rigorous genome assembly: Our exploration of genome assembly problems has led to the development of optimal safe and complete algorithms based on novel concepts like "cut paths" and "remainder structure." These innovations have been integrated into popular genome assembly tools, demonstrating substantial improvements in assembly contiguity, especially on metagenomic datasets. Our work has shown significant advancements over previous heuristic approaches, providing a mathematically rigorous framework for integrating graph reachability and abundance information into the assembly problem.
3. Breakthroughs in path covers and flow decomposition: We have achieved groundbreaking results in solving the Minimum Path Cover and Minimum Flow Decomposition problems. Our linear-time parameterized algorithm for Minimum Path Cover is the first of its kind, offering efficient solutions motivated by practical applications in pangenomics. Additionally, our novel methodology for speeding up integer linear programs for flow decomposition problems, using safe paths, has resulted in significant computational improvements, with speed-ups of up to 200 times on the hardest instances (Runner-up to the Best Paper Award at the SEA 2024 conference).
4. Efficient bioinformatics programs: We have developed highly efficient tools for constructing graphs from sequencing data, such as GGCAT, which is up to 39 times faster than previous tools. Our contributions have also included solving previously conjectured NP-hard problems in linear time (eulertigs, Best Paper Award at the WABI 2022 conference), whose practical performance confirms this theoretical time complexity. We also applied the concept of safe partial solution to predicting stable protein structures, which highlights the versatility and potential of safe and complete algorithms to advance bioinformatics and related fields.