Skip to main content
Go to the home page of the European Commission (opens in new window)
English en
CORDIS - EU research results
CORDIS

Modern Pattern Matching

Periodic Reporting for period 5 - MPM (Modern Pattern Matching)

Reporting period: 2024-01-01 to 2025-08-31

The Modern Pattern Matching (MPM) project was launched to address a fundamental gap between classical algorithmic theory and the demands of the "Big Data" era. Historically, string processing algorithms—the engines behind searching and data analysis—were designed for the Random Access Memory (RAM) model, which assumes that all data can be stored and accessed multiple times. In the modern world, the massive volume of data passing through the internet and generated by scientific sensors makes this assumption obsolete. The core problem addressed by MPM is how to identify complex patterns and calculate distances between strings when the data can only be scanned once (the one-pass model) and memory is extremely limited (the streaming model).



This research is vital for society because the underlying technologies for digital security, real-time healthcare, and global communications rely on efficient string processing. For instance, Network Intrusion Detection Systems (NIDS) must identify malicious signatures in high-speed data streams to prevent cyberattacks, and genomic researchers require tools to analyze DNA sequences in real-time as they emerge from sequencing hardware.



The overall objectives were to modernize the field of stringology by integrating advanced techniques such as sparse recovery and sketching—tools typically used in signal processing—to solve long-standing challenges in approximate pattern matching and dictionary searching. By the conclusion of the project, we successfully established a new library of algorithms that operate with poly-logarithmic space, providing the first such solutions for measuring Hamming and Edit distances in data streams. These results provide the theoretical foundation for next-generation systems capable of processing global-scale data in real-time.
The project achieved several landmark breakthroughs in algorithmic efficiency and theoretical understanding. We developed a comprehensive suite of streaming algorithms, including new results for the k-mismatch problem and near-optimal methods for approximating the Hamming distance in the streaming model. A primary technical highlight was solving a "decade-old" open question by developing the first poly-logarithmic space streaming algorithm for pattern matching under Edit distance.

Furthermore, the project established a wide range of conditional lower bounds that define the computational limits of the field. These include proving the optimality of dynamic LZ77 factorization under the Strong Exponential Time Hypothesis (SETH), showing that update times cannot be improved beyond O(n^2/3). We established hardness results for online dictionary matching with one gap based on the 3SUM conjecture and explored the complexity of set disjointness and intersection with bounded universes. We also provided a breakthrough disproof of the Strong 3SUM-INDEXING Conjecture and established optimal time-space tradeoffs for Color Distance Oracles (CDO) and the "snippets" problem under the APSP hypothesis.

The research also yielded significant results for the biological community, including "GreedyMini," a novel method for generating low-density DNA minimizers , and "FiSSC" (Finding Smallest Sequence Covers), which addresses RNA editing by covering sets of degenerate reads.

Our dissemination efforts included a "Holiday School" and a program on the "Theory of Data Science and Deep Learning". We hosted prominent visitors, including Tatiana A. Starikovskaya, Edo Liberty, Przemek Uznański, Konstantin Makarychev, Jelani Nelson, Tal Wagner, David Woodruff, Omri Weinstein, Jeremy Fineman, David Harris, and Cliff Stein. Results were presented at top-tier conferences such as STOC, FOCS, and SODA.
MPM has significantly advanced the state of the art by shifting the focus of stringology from static computation to dynamic, stream-oriented architectures. Before this project, many approximate matching problems were thought to be intractable under strict memory constraints. By adapting sparse recovery—a methodology from compressed sensing—we proved that it is possible to reconstruct errors and mismatches in a text stream using only a tiny "sketch" of the data.



Beyond creating new algorithms, the project established a robust framework for "conditional lower bounds." This allowed us to map the complexity landscape of string problems, providing mathematical evidence for why certain tasks have inherent performance barriers. We successfully disproved existing conjectures and provided the first optimal tradeoffs for structural problems like the snippets problem.



The expected long-term impact of this project is a paradigm shift in handling high-velocity data. The tools developed for streaming edit distance, DNA minimizers, and sequence covers are now being integrated into biological sequence analysis and network security protocols. The legacy of MPM is a modernized toolkit that allows scientists and engineers to perform sophisticated pattern recognition on data scales that were previously considered impossible to process in real-time.
example-binary-to-dna.png
unnamed.png
screenshot-2025-12-25-at-18-46-30.jpg
My booklet 0 0