Skip to main content
European Commission logo print header

Sublinear Algorithms for Modern Data Analysis

Periodic Reporting for period 3 - SUBLINEAR (Sublinear Algorithms for Modern Data Analysis)

Reporting period: 2021-03-01 to 2022-08-31

The design of efficient algorithms and mapping of the `boundary of tractability' have been the major goal of computer science since the dawn of the digital era, and polynomial runtime has been the classical notion of efficiency since then. Over the past few decades, our computational, measurement and storage capabilities have grown exponentially. However, the sizes of datasets to be processed have grown even faster, resulting in the `big data' phenomenon, which has led to a shift of the `boundary of tractability'. Indeed, on modern inputs quadratic, and sometimes even linear time algorithms often become prohibitively expensive. This calls for a new class of techniques with sublinear resource requirements. Specifically, processing large datasets requires algorithms that can compute answers using sublinear runtime, operate under tight restrictions on space (streaming algorithms) and communication (sketching algorithms), or even minimize the number of accesses to the input
(sample complexity). The goal of this project is to design such techniques for fundamental data processing problems, thereby building a solid theoretical foundation for modern data analysis.

This project is focused on three main directions: sublinear time graph algorithms and graph sketching, understanding the limits of robust graph exploration in small space and sparse Fourier transform beyond sparsity. The first of these directions amounts to designing space optimal algorithms for solving processing very large networks (e.g. community detection, clustering, finding matchings). The second direction asks for impossibility results that show that algorithms that we developed in the first direction are optimal. Such impossibility results are an integral part of our goal of `mapping the boundary of tractability', as they show us when the algorithmic results that we have are best possible in our computational models. The last direction asks for very fast algorithms for one of the central tools of data analysis, namely the Fourier transform. Specifically, our goal in this direction is to design techniques for computing the Fourier transform that exploit structural properties of inputs that often occur in practice to obtain fast algorithms.
We have been able to achieve exciting progress on all several directions set forth in our proposal, with the corresponding papers being disseminated at the top venues for research in theoretical computer science (STOC/FOCS/SODA).

On the algorithmic side of massive graph analysis, our recent work includes several definitive results on central problems in the area. Some of the highlights include:

1. an essentially optimally fast (linear time in output size) algorithm for computing sparsifiers of graphs that can operate even when the input graph undergoes insertions and deletions, a common model for modern massive dynamic graphs.
The main technique underlying this result is a new way of partitioning graphs based on the effective resistance metric that can be viewed as a 'hashing scheme' for graphs data that allows recovery of the `dominant' edges of the graph. The scheme is based on a combination of dimensionality reduction techniques and locality sensitive hashing, a fundamental (and practical!) method for nearest neighbor search.

2. query optimal algorithms for approximating graph cluster structure in sublinear time using a few random walks. Our first result on this problem introduced new linear algebraic methods for testing clusterability using an asymptotically optimal number of queries. Our more recent work proposes a `clustering oracle', i.e. a query and runtime efficient method for accessing cluster structure of graphs in sublinear time. The main innovation in this work is the idea of obtaining access to the spectral embedding of the graph using short random walks.

3. an algorithm for obtaining constant factor approximation to maximum matching size in polylogarithmic space from a sequence of random samples of edges of the graph (obtaining an approximation ratio close to 1 likely requires significantly more space, i.e. this result is likely quite close to optimal), as well as a resolution of the optimal competitive ratio for maximum matching in the popular edge arrival model, a central problem in the field.

We have also been able to obtain strong advances in lower bounds for streaming graph computation. First, our recent result gave an optimal space lower bound for the complexity of approximating the MAX-CUT problem in the single pass streaming model of computation. This result is based on a number of new techniques that we develop for applying Fourier analytic methods to streaming lower bounds. One of the central ideas in our MAX-CUT lower bound is to use the convolution theorem in Fourier analysis to lower bound the communication complexity of multiparty problems. This idea has also led us to optimal sketching lower bounds for the subgraph counting problem.

Finally, we have recently been able to introduce new techniques for computing the Fourier transform of structured signals, designing algorithms that can exploit Fourier structure well beyond the standard sparsity assumption. Highlights of this work include a dimension-independent Fourier transform, which alleviates the curse of dimensionality inherent in all previous algorithms for this problem, a close to sample optimal universal sampling method for functions with `simple' Fourier transforms (this work significantly extends the classical results of Landau, Pollak and Slepian on optimal reconstruction of bandlimited functions) and the first algorithms for numerical linear algebra in for kernel matrices that avoid the curse of dimensionality inherent in all previous approaches.
Our results so far have led to several exciting questions both in algorithms and lower bounds for processing massive graphs, including the problem of understanding optimal pass vs space tradeoffs for fundamental graph problems, designing optimal methods for extracting information about graphs from short random walks in unstructured settings (i.e. without assumptions on existence of pronounced cluster structure) and several others. We expect to make strong advances on these fundamental directions until the end of the project. At the same time some of our recent results (e.g. our new algorithms for the Sparse Fourier Transform problem) appear to hold promise for good empirical performance, a direction that we also expect to pursue in the coming months.