Periodic Reporting for period 5 - MPM (Modern Pattern Matching)
Okres sprawozdawczy: 2024-01-01 do 2025-08-31
This research is vital for society because the underlying technologies for digital security, real-time healthcare, and global communications rely on efficient string processing. For instance, Network Intrusion Detection Systems (NIDS) must identify malicious signatures in high-speed data streams to prevent cyberattacks, and genomic researchers require tools to analyze DNA sequences in real-time as they emerge from sequencing hardware.
The overall objectives were to modernize the field of stringology by integrating advanced techniques such as sparse recovery and sketching—tools typically used in signal processing—to solve long-standing challenges in approximate pattern matching and dictionary searching. By the conclusion of the project, we successfully established a new library of algorithms that operate with poly-logarithmic space, providing the first such solutions for measuring Hamming and Edit distances in data streams. These results provide the theoretical foundation for next-generation systems capable of processing global-scale data in real-time.
Furthermore, the project established a wide range of conditional lower bounds that define the computational limits of the field. These include proving the optimality of dynamic LZ77 factorization under the Strong Exponential Time Hypothesis (SETH), showing that update times cannot be improved beyond O(n^2/3). We established hardness results for online dictionary matching with one gap based on the 3SUM conjecture and explored the complexity of set disjointness and intersection with bounded universes. We also provided a breakthrough disproof of the Strong 3SUM-INDEXING Conjecture and established optimal time-space tradeoffs for Color Distance Oracles (CDO) and the "snippets" problem under the APSP hypothesis.
The research also yielded significant results for the biological community, including "GreedyMini," a novel method for generating low-density DNA minimizers , and "FiSSC" (Finding Smallest Sequence Covers), which addresses RNA editing by covering sets of degenerate reads.
Our dissemination efforts included a "Holiday School" and a program on the "Theory of Data Science and Deep Learning". We hosted prominent visitors, including Tatiana A. Starikovskaya, Edo Liberty, Przemek Uznański, Konstantin Makarychev, Jelani Nelson, Tal Wagner, David Woodruff, Omri Weinstein, Jeremy Fineman, David Harris, and Cliff Stein. Results were presented at top-tier conferences such as STOC, FOCS, and SODA.
Beyond creating new algorithms, the project established a robust framework for "conditional lower bounds." This allowed us to map the complexity landscape of string problems, providing mathematical evidence for why certain tasks have inherent performance barriers. We successfully disproved existing conjectures and provided the first optimal tradeoffs for structural problems like the snippets problem.
The expected long-term impact of this project is a paradigm shift in handling high-velocity data. The tools developed for streaming edit distance, DNA minimizers, and sequence covers are now being integrated into biological sequence analysis and network security protocols. The legacy of MPM is a modernized toolkit that allows scientists and engineers to perform sophisticated pattern recognition on data scales that were previously considered impossible to process in real-time.