Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics

Project Information

REGINDEX

Grant agreement ID: 101039208

DOI

10.3030/101039208

EC signature date 28 January 2022

Start date 1 September 2022

End date 31 August 2027

Funded under

European Research Council (ERC)

Total cost

€ 1 385 743,00

EU contribution

€ 1 385 743,00

1 385 743,00

Coordinated by

UNIVERSITA CA' FOSCARI VENEZIA
Italy

Periodic Reporting for period 1 - REGINDEX (Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics)

Reporting period: 2022-09-01 to 2025-02-28

The REGINDEX project sets out to develop new efficient algorithms and data structures for searching large amounts of structured data. This objective is becoming a necessity in several scientific fields, such as bioinformatics, databases, and web engines, having in common the fact that the data they need to process is being produced at exponentially-increasing rates. As an example, consider the case of DNA sequencing. It took scientists 13 years (1990-2003) to complete the first draft of the human genome; this feat costed about 2.7 billion US dollars. By 2006, the cost of sequencing a Human genome dropped to about 14 million US dollars. Today, a next-generation sequencing machine can sequence a complete human genome in less than 30 hours for less than 1000 US dollars. This technological revolution has already led to the generation of databases containing hundreds of thousands of Human genomes (each consisting of about 3 billion nucleotides, i.e. "DNA letters") and is already revolutionizing medicine, paving the way to personalized (genome-based) treatments. From a computational point of view however, this revolution poses enormous algorithmic challenges. Just storing 1 million uncompressed Human genomes would require about 1000 Terabytes of available disk space, let alone pre-processing this data in order to support fast searches on it (a functionality that is vital in order to discover whether a particular DNA mutation has been seen before).

REGINDEX tackles this challenge by extending the concept of “sorting” to structured data. Sorting is a familiar concept that often allows searching data much faster: consider, for example, the sorted words in a dictionary. REGINDEX’s main goal is to show that this simple idea can be extended to much more structured data, even when the data is compressed. More in detail, the project focuses on indexing labeled graphs and regular languages for substring search queries. One can imagine a labeled graph as a generalization of a simple text. While in a text, letters occur consecutively, in a labeled graph one can specify which “jumps” between different portions of the text are allowed (and which are not). For example, a labeled graph can be used to encode a family of related genomes: a particular sub-sequence could be missing in a genome while it could be present in others. A related concept from theoretical computer science is that of “regular language”: a set of strings (e.g. 1000 Human genomes) can be encoded compactly as a set of rules (called “regular expressions”) specifying how to generate any string from the set. REGINDEX’s broad objective is to develop compressed representations for labeled graphs and regular languages, supporting efficient substring queries: to find out if and where a query short string (e.g. a short DNA sequence) appears as a substring in the indexed set of strings. In order to achieve this ambitious goal, REGINDEX introduces the novel concept of “co-lexicographic partial order”: a powerful tool that allows sorting (and therefore also compressing and indexing) labeled graphs and regular languages, despite their complex structure. Ultimately, the techniques developed within the REGINDEX project will make it possible to store millions of Human genomes in just a few Gigabytes, while at the same time supporting fast substring search queries on the compressed database.

The work performed within the project has been published in two top-level journal publications (Journal of the ACM and IEEE Transactions on Information Theory) and twelve other articles that appeared in the proceedings of international conferences of high quality (ESA, CPM, SPIRE, SEA, DCC) presented by the members of the REGINDEX team in international conferences and workshops. The project is developing a very rich theory of compressed data structures for sorting and indexing automata and regular languages. Our results show how introducing an underlying order in finite-state automata simplifies several computational tasks such as converting Nondeterministic Finite-State Automata into Deterministic ones.

A big part of our work in the first two years of the project, has been dedicated to (i) developing fast algorithms for sorting finite automata and to (ii) optimizing compressed indexes for regular languages. As far as point (i) is concerned, our biggest achievement to date is a partition-refinement algorithm (published in the proceedings of the European Symposium on Algorithms 2023) solving this problem on arbitrary NFA in the case of total orders (quasi-Wheeler orders) in linearithmic time. Prior to this work, we showed (CPM 2023) that within the same time one could sort DFAs with a co-lex order of arbitrary width. More recently, in SPIRE 2024 we showed that co-lex pre-orders can be computed on any NFA in quadratic time. We are now working to reduce this running time to linearithmic or even linear, to develop new data structures exploiting co-lex orders (until now, we successfully generalized the FM-index and LCP array to automata - JACM 2023 and CPM 2024), and to use co-lex orders to develop new graph compression tools. In point (ii), we identified a new extremely efficient indexing strategy for repetitive collections of strings (finite regular languages): suffixient Arrays (work submitted and under review).

The project so far has produced two main results of potential impact:

1) Our theory of co-lex orders (introduced in our JACM'23 publication and further explored and improved in further 13 research articles) merges three distant research fields: stringology (i.e. string processing), graph theory and regular language theory. Our theory is allowing for the first time to apply techniques of one field into the others. A list of examples of this cross-fertilization between fields includes (but is not limited to):

- Our partition-refinement algorithm Published in the proceedings of the European Symposium on Algorithms 2023, derived from techniques belonging to graph theory (bisimulation) and algorithmic language theory (DFA minimization), when applied to strings yields a brand new Suffix Sorting algorithm that was not discovered before in the literature (despite the intense research work performed on suffix sorting by the community in the past 50 years).

- Conversely, our co-lex orders (JACM 2023) allow us to apply the theory of suffix sorting (stringology) to graph theory and language theory, yielding powerful results such as a new parameterization of the powerset construction algorithm for determinizing NFAs.

Future research will be devoted to speeding up our graph sorting algorithms in order to make them applicable in practical scenarios such as computational pangenomics.

2) A completely novel Suffix Array compression technique: suffixient Arrays (work under review). This new compressed data structure for finite regular languages (collections of strings) solves a big open problem, namely, I/O-efficient pattern matching in compressed space. Our new index is smaller and two orders of magnitude faster than the state of the art for the problem (the r-index). We expect that this finding will have a big impact in the bioinformatics community and we are working to extend this result to more general regular languages (by generalizing it to co-lex orders).

REGINDEX logo

Periodic Reporting for period 1 - REGINDEX (Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics)

Share this page Share this page on social networks

Download Download the content of the page