Periodic Reporting for period 1 - REGINDEX (Compressed Indexes for Regular Languages with Applications to Computational Pan-genomics)
Reporting period: 2022-09-01 to 2025-02-28
REGINDEX tackles this challenge by extending the concept of “sorting” to structured data. Sorting is a familiar concept that often allows searching data much faster: consider, for example, the sorted words in a dictionary. REGINDEX’s main goal is to show that this simple idea can be extended to much more structured data, even when the data is compressed. More in detail, the project focuses on indexing labeled graphs and regular languages for substring search queries. One can imagine a labeled graph as a generalization of a simple text. While in a text, letters occur consecutively, in a labeled graph one can specify which “jumps” between different portions of the text are allowed (and which are not). For example, a labeled graph can be used to encode a family of related genomes: a particular sub-sequence could be missing in a genome while it could be present in others. A related concept from theoretical computer science is that of “regular language”: a set of strings (e.g. 1000 Human genomes) can be encoded compactly as a set of rules (called “regular expressions”) specifying how to generate any string from the set. REGINDEX’s broad objective is to develop compressed representations for labeled graphs and regular languages, supporting efficient substring queries: to find out if and where a query short string (e.g. a short DNA sequence) appears as a substring in the indexed set of strings. In order to achieve this ambitious goal, REGINDEX introduces the novel concept of “co-lexicographic partial order”: a powerful tool that allows sorting (and therefore also compressing and indexing) labeled graphs and regular languages, despite their complex structure. Ultimately, the techniques developed within the REGINDEX project will make it possible to store millions of Human genomes in just a few Gigabytes, while at the same time supporting fast substring search queries on the compressed database.
A big part of our work in the first two years of the project, has been dedicated to (i) developing fast algorithms for sorting finite automata and to (ii) optimizing compressed indexes for regular languages. As far as point (i) is concerned, our biggest achievement to date is a partition-refinement algorithm (published in the proceedings of the European Symposium on Algorithms 2023) solving this problem on arbitrary NFA in the case of total orders (quasi-Wheeler orders) in linearithmic time. Prior to this work, we showed (CPM 2023) that within the same time one could sort DFAs with a co-lex order of arbitrary width. More recently, in SPIRE 2024 we showed that co-lex pre-orders can be computed on any NFA in quadratic time. We are now working to reduce this running time to linearithmic or even linear, to develop new data structures exploiting co-lex orders (until now, we successfully generalized the FM-index and LCP array to automata - JACM 2023 and CPM 2024), and to use co-lex orders to develop new graph compression tools. In point (ii), we identified a new extremely efficient indexing strategy for repetitive collections of strings (finite regular languages): suffixient Arrays (work submitted and under review).
1) Our theory of co-lex orders (introduced in our JACM'23 publication and further explored and improved in further 13 research articles) merges three distant research fields: stringology (i.e. string processing), graph theory and regular language theory. Our theory is allowing for the first time to apply techniques of one field into the others. A list of examples of this cross-fertilization between fields includes (but is not limited to):
- Our partition-refinement algorithm Published in the proceedings of the European Symposium on Algorithms 2023, derived from techniques belonging to graph theory (bisimulation) and algorithmic language theory (DFA minimization), when applied to strings yields a brand new Suffix Sorting algorithm that was not discovered before in the literature (despite the intense research work performed on suffix sorting by the community in the past 50 years).
- Conversely, our co-lex orders (JACM 2023) allow us to apply the theory of suffix sorting (stringology) to graph theory and language theory, yielding powerful results such as a new parameterization of the powerset construction algorithm for determinizing NFAs.
Future research will be devoted to speeding up our graph sorting algorithms in order to make them applicable in practical scenarios such as computational pangenomics.
2) A completely novel Suffix Array compression technique: suffixient Arrays (work under review). This new compressed data structure for finite regular languages (collections of strings) solves a big open problem, namely, I/O-efficient pattern matching in compressed space. Our new index is smaller and two orders of magnitude faster than the state of the art for the problem (the r-index). We expect that this finding will have a big impact in the bioinformatics community and we are working to extend this result to more general regular languages (by generalizing it to co-lex orders).