Sublinear Algorithms for Modern Data Analysis

Periodic Reporting for period 4 - SUBLINEAR (Sublinear Algorithms for Modern Data Analysis)

Periodo di rendicontazione: 2022-09-01 al 2023-08-31

The design of efficient algorithms and mapping of the `boundary of tractability' have been the major goal of computer science since the dawn of the digital era, and polynomial runtime has been the classical notion of efficiency since then. Over the past few decades, our computational, measurement and storage capabilities have grown exponentially. However, the sizes of datasets to be processed have grown even faster, resulting in the `big data' phenomenon, which has led to a shift of the `boundary of tractability'. Indeed, on modern inputs quadratic, and sometimes even linear time algorithms often become prohibitively expensive. This calls for a new class of techniques with sublinear resource requirements. Specifically, processing large datasets requires algorithms that can compute answers using sublinear runtime, operate under tight restrictions on space (streaming algorithms) and communication (sketching algorithms), or even minimize the number of accesses to the input
(sample complexity). The goal of this project is to design such techniques for fundamental data processing problems, thereby building a solid theoretical foundation for modern data analysis.

This project is focused on three main directions: sublinear time graph algorithms and graph sketching, understanding the limits of robust graph exploration in small space and sparse Fourier transform beyond sparsity. The first of these directions amounts to designing space optimal algorithms for solving processing very large networks (e.g. community detection, clustering, finding matchings). The second direction asks for impossibility results that show that algorithms that we developed in the first direction are optimal. Such impossibility results are an integral part of our goal of `mapping the boundary of tractability', as they show us when the algorithmic results that we have are best possible in our computational models. The last direction asks for very fast algorithms for one of the central tools of data analysis, namely the Fourier transform. Specifically, our goal in this direction is to design techniques for computing the Fourier transform that exploit structural properties of inputs that often occur in practice to obtain fast algorithms.

The project has resulted in several top results on central problems in sublinear algorithms, and has opened several exciting new lines of inquiry that I am sure will continue to drive big data algorithms forward for years to come.

We have been able to achieve exciting progress on all major directions set forth in our proposal, with the corresponding papers being disseminated at the top venues for research in theoretical computer science (STOC/FOCS/SODA).

On the algorithmic side, our recent work includes several definitive results on central problems in the area. Some of the highlights include:

1. an essentially optimally fast (linear time in output size) algorithm for computing sparsifiers of graphs that can operate even when the input graph undergoes insertions and deletions, a common model for modern massive dynamic graphs.
The main technique underlying this result is a new way of partitioning graphs based on the effective resistance metric that can be viewed as a 'hashing scheme' for graphs data that allows recovery of the `dominant' edges of the graph. The scheme is based on a combination of dimensionality reduction techniques and locality sensitive hashing, a fundamental (and practical!) method for nearest neighbor search.

2. query optimal algorithms for approximating graph cluster structure in sublinear time using a few random walks. Our first result on this problem introduced new linear algebraic methods for testing clusterability using an asymptotically optimal number of queries. Our more recent work proposes a `clustering oracle', i.e. a query and runtime efficient method for accessing cluster structure of graphs in sublinear time. The main innovation in this work is the idea of obtaining access to the spectral embedding of the graph using short random walks. Our most recent results in this line of work show how to extract *hierarchical* cluster structure from graphs in sublinear time using random walks.

3. an algorithm for obtaining constant factor approximation to maximum matching size in polylogarithmic space from a sequence of random samples of edges of the graph (obtaining an approximation ratio close to 1 likely requires significantly more space, i.e. this result is likely quite close to optimal), as well as a resolution of the optimal competitive ratio for maximum matching in the popular edge arrival model, a central problem in the field.

4. state of the art results on kernel density estimation: we give a single data structure that simultaneously improves upon (or matches) all prior work on kernel density estimation. In a more recent work we introduced a new approach to kernel density estimation that combines discrepancy theory with hashing based techniques, avoiding quadratic dependence on inverse relative precision inherent to all sampling based approaches.

5. nearly optimal results on constructing spectral sparsifiers of hypergraphs, showing that no dependence on the degree of non-linearity (i.e. on maximum hyper edge size) is needed.

We have also been able to obtain strong advances in lower bounds for streaming graph computation. First, our recent result gave an optimal space lower bound for the complexity of approximating the MAX-CUT problem in the single pass streaming model of computation. This result is based on a number of new techniques that we develop for applying Fourier analytic methods to streaming lower bounds. One of the central ideas in our MAX-CUT lower bound is to use the convolution theorem in Fourier analysis to lower bound the communication complexity of multiparty problems. This idea has also led us to optimal sketching lower bounds for the subgraph counting problem, as well as, more recently tight bounds for a graph component counting problem in random order streams.

Finally, we have recently been able to introduce new techniques for computing the Fourier transform of structured signals, designing algorithms that can exploit Fourier structure well beyond the standard sparsity assumption. Highlights of this work include a dimension-independent Fourier transform, which alleviates the curse of dimensionality inherent in all previous algorithms for this problem, a close to sample optimal universal sampling method for functions with `simple' Fourier transforms (this work significantly extends the classical results of Landau, Pollak and Slepian on optimal reconstruction of bandlimited functions) and the first algorithms for numerical linear algebra in for kernel matrices that avoid the curse of dimensionality inherent in all previous approaches.

Our work has led to many exciting directions that were not anticipated at the beginning of the project. These include algorithms for graph processing in random order streams, expander decompositions in insertion/deletion streams, and more.

Periodic Reporting for period 4 - SUBLINEAR (Sublinear Algorithms for Modern Data Analysis)

Condividi questa pagina

Scarica