Skip to main content

Modern Pattern Matching

Periodic Reporting for period 2 - MPM (Modern Pattern Matching)

Reporting period: 2019-01-01 to 2020-06-30

This project aims at advancing a new wave of progress in Pattern Matching which is the algorithmic world of computation on strings. The advances in technology over the last decade and the massive amount of data passing through the internet has intrigued and challenged computer scientists, as the old models of computation used before this era are now less relevant or too slow.
To this end, new modern computational models have been suggested, allowing computer scientists to tackle these technological advances and develop new techniques for algorithmic development under the new
constraints arising from the new technologies. These new technological challenges and scenarios also encapsulate, but are not limited to, the so-called Big-data phenomenon.
In the most basic sense, in these models one is allowed to scan the input only once and the goal is to compute some function of the input. This is commonly known as the one-pass model which has a few variations including, for example, the streaming model where we are allowed to use very small space (usually poly-logarithmic number of machine words). The model is helpful, for example, in modeling routers where one may want to perform statistics on the massive amount of data flowing through, or in very large data bases where updates to a data base are passed through an entry point, but one wishes to avoid having to scan all of the data in order to answer some statistical query.

Several subfields of computer science have seen great progress in dealing with the new challenges, including statistical computation, graph theory, numerical linear algebra, and computational geometry which have all been rigorously working on providing solutions for algorithmic challenges in such
modern models. This line of work has introduced new algorithmic tools such as sparse recovery. However, although the pattern matching community has had great success in coping with many algorithmic challenges, it has only recently started gaining insight on what can be done in these new
models, and is lagging behind with respect to algorithmic development within the new technological constraints.
After all, it was only recently where we were the first to show that the most basic pattern matching problem, where one wishes to find all occurrences of a given pattern in a streaming text, can be solved in the streaming model using very small space. Although some recent work has addressed these modern models, this work is only in its initiation phase, and most of the challenges have not been explored yet.
In particular, there are no algorithms for pattern matching problems that have utilized ideas from sparse recovery.
This delay in algorithmic development is especially surprising given the fundamental nature of string problems in the computing world, and the lack of techniques in these models is apparent in the industry as can be seen by several industrial companies which have approached us recently for advice on such challenges.
We emphasize that in addition to the new models of computation, the RAM model is still very much open for research in the
context of modern models. This is especially the case when one is willing to suffer an approximate solution in order to reduce time or space cost.

From a lower bound perspective, while techniques such as information-transfer and other ideas from communication complexity have recently been utilized in proving poly-logarithmic time lower bounds for various problems in these modern models, only recently there has been progress in establishing polynomial lower bounds for algorithmic challenges in the theoretical pattern matching community. This is done by reducing problems that are popularly
conjectured to be hard to various pattern matching problems, thereby providing evidence that these patter matching problems are also hard to solve. Such lower bound proves are known as conditional lower bounds. Indeed,
the task of proving unrestricted polynomial time lower bounds for algorithmic problems has proven to be beyond the grasp of theoretical computer science, and so conditional lower bounds have been gaining popularity.
However, theoretical computer science, and in particular the theoretical pattern matching field, has not yet established tools for proving polynomial time/space tradeoffs for various algorithmic tasks.
Such LBs are of particular interest in modern models since one can typically either invest a lot of time, or pull together many machines, to preprocess data in advance, thereby allowing for fast query processing on several end-users' machines all utilizing the same (small) data structure. In such scenarios the space usage is a parameter that is of greater importance than the preprocessing time.

The general goal of the proposed research is to introduce modern techniques and concepts into the theoretical pattern matching field thereby tackling the difficulties that have risen in the modern computational world. We plan on developing efficient novel algorithmic solutions, with provable asymptotic guarantees on performance, for various pattern matching problems in modern models of computation. One of the main algorithmic techniques that we plan on utilizing is sparse recovery, which is a hot, young and evolving tool kit, and has yet to be used in pattern matching algorithms. We also plan on expanding sketching tools and string combinatorics to suit the modern models.
From a lower bound perspective, we plan on proving matching lower bounds and hardness results using conditional lower bounds, information transfer, and communication complexity techniques. We also plan on introducing a new framework for proving
conditional time/space lower bounds, which is a new concept within theoretical computer science.
The project has mainly focused on training the first wave of students and postdocs, and research on streaming pattern matching, and in particular on various distance metrics. In addition, the research group has also worked on algorithms in distributed settings, conditional lower bounds, and dynamic graph algorithms.

The main results achieved so far are as follows:
1. New algorithms for approximate Hamming distance computation in many computational models, ranging from the classical offline model to the streaming model.
2. Improved data structures for computing the Longest Common Extension of two indices in a text, leading to improved algorithms for various pattern matching problems.
3. New algorithms in the streaming model for the k-mismatch problem, with improved total runtime.
The research so far has advanced the state of the art foor many algorithmic problems in the streaming pattern matching world.
The following the papers, are the ones we view as the most significant outcomes of the project to date (note that authors appear in alphabetical order):

- Timothy M. Chan, Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat. Approximating text-to-pattern Hamming distances. STOC 2020: 643-656
- Or Birenzwige, Shay Golan, Ely Porat. Locally Consistent Parsing for Text Indexing in Small Space. SODA 2020: 607-626
- Shay Golan, Tomasz Kociumaka, Tsvi Kopelowitz, Ely Porat, Przemyslaw Uznanski. Improved Circular k-Mismatch Sketches. APPROX/RANDOM 2020: 46:1-46:24