## Periodic Reporting for period 4 - SSBD (Small Summaries for Big Data)

Reporting period: 2019-11-01 to 2021-04-30

In both business and scientific communities it is agreed that “big data” holds both tremendous promise, and substantial challenges. There is much potential for extracting useful intelligence and actionable information from the large quantities of data generated and captured by modern information processing systems.

The challenges are expressed in terms of “four V’s”: volume, variety, velocity, and veracity. That is, big data challenges involve not only the sheer volume of the data, but the fact that it can represent a complex variety of entities and interactions between them, and new observations that arrive, often across multiple locations, at high velocity, and with uncertain provenance. Some examples of applications that generate big data include physical data, medical data, human activity data and business data. Across all of these disparate settings, certain common themes emerge. The data in question is large, and growing. The applications seek to extract patterns, trends or descriptions of the data. Ensuring the scalability of systems, and the timeliness and veracity of the analysis is vital in many of these applications. In order to realize the promise of these sources of data, we need new methods that can handle them effectively.

While such sources of big data are becoming increasingly common, the resources to process them (chiefly, processor speed, fast memory and slower disk) are growing at a slower pace. The consequence of this trend is that there is an urgent need for more effort directed towards capturing and processing data in many critical applications. Careful planning and scalable architectures are needed to fulfill the requirements of analysis and information extraction on big data. In response to these needs, new computational paradigms are being adopted to deal with the challenge of big data. Large scale distributed computation is a central piece: the scope of the computation can exceed what is feasible on a single machine, and so clusters of machines work together in parallel. On top of these architectures, parallel algorithms are designed which can take the complex task and break it into independent pieces suitable for distribution over multiple machines.

A central challenge within any such system is how to compute and represent complex features of big data in a way that can be processed by many single machines in parallel. A vital component is to be able to build and manipulate a compact summary of a large amount of data. This powerful notion of a small summary, in all its many and varied forms, is the subject of study of this proposal. The idea of a summary is a natural and familiar one. It should represent something large and complex in a compact fashion. Inevitably, a summary must dispense with some of the detail and nuance of the object which it is summarizing. However, it should also preserve some key features of the object in a very accurate fashion.

The project aims were to promulgate new theory and practice in the area of Small Summaries for Big Data (SSBD). The vision is to substantially advance the state of the art in data summarization, to the point where accurate and effective summaries are available for a wide range of problems, and can be used seamlessly in applications that process big data. While the mathematics needed for this work can be complex, at the same time the power of the results is great. The resulting algorithms can be simple enough to be written in a few lines of code, and have applications that cover many different areas of business and society, affecting millions of people. The aim is to develop efficient, scalable methods that can be used by any organization to analyze its “big data”, rather than just big businesses.

The challenges are expressed in terms of “four V’s”: volume, variety, velocity, and veracity. That is, big data challenges involve not only the sheer volume of the data, but the fact that it can represent a complex variety of entities and interactions between them, and new observations that arrive, often across multiple locations, at high velocity, and with uncertain provenance. Some examples of applications that generate big data include physical data, medical data, human activity data and business data. Across all of these disparate settings, certain common themes emerge. The data in question is large, and growing. The applications seek to extract patterns, trends or descriptions of the data. Ensuring the scalability of systems, and the timeliness and veracity of the analysis is vital in many of these applications. In order to realize the promise of these sources of data, we need new methods that can handle them effectively.

While such sources of big data are becoming increasingly common, the resources to process them (chiefly, processor speed, fast memory and slower disk) are growing at a slower pace. The consequence of this trend is that there is an urgent need for more effort directed towards capturing and processing data in many critical applications. Careful planning and scalable architectures are needed to fulfill the requirements of analysis and information extraction on big data. In response to these needs, new computational paradigms are being adopted to deal with the challenge of big data. Large scale distributed computation is a central piece: the scope of the computation can exceed what is feasible on a single machine, and so clusters of machines work together in parallel. On top of these architectures, parallel algorithms are designed which can take the complex task and break it into independent pieces suitable for distribution over multiple machines.

A central challenge within any such system is how to compute and represent complex features of big data in a way that can be processed by many single machines in parallel. A vital component is to be able to build and manipulate a compact summary of a large amount of data. This powerful notion of a small summary, in all its many and varied forms, is the subject of study of this proposal. The idea of a summary is a natural and familiar one. It should represent something large and complex in a compact fashion. Inevitably, a summary must dispense with some of the detail and nuance of the object which it is summarizing. However, it should also preserve some key features of the object in a very accurate fashion.

The project aims were to promulgate new theory and practice in the area of Small Summaries for Big Data (SSBD). The vision is to substantially advance the state of the art in data summarization, to the point where accurate and effective summaries are available for a wide range of problems, and can be used seamlessly in applications that process big data. While the mathematics needed for this work can be complex, at the same time the power of the results is great. The resulting algorithms can be simple enough to be written in a few lines of code, and have applications that cover many different areas of business and society, affecting millions of people. The aim is to develop efficient, scalable methods that can be used by any organization to analyze its “big data”, rather than just big businesses.

The plan of work for this project is broken down into four themes: Velocity (New summaries for fundamental computations); variety (new summaries for complex structures); veracity (summaries for outsourced computations); and volume (distributed summaries). We identify a few highlights of achievements under each theme:

• Velocity: We developed new techniques for estimating the size of the maximum matching in a graph. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs. In particular, we present a one-pass algorithm that takes O(log n) space and c-approximates the size of the maximum matching in graphs with arboricity within a factor of O(c). This improves significantly upon the state-of-the-art O(n^2/3)-space streaming algorithms, and is the first poly-logarithmic space algorithm for this problem.

• Variety: We developed techniques for approximating important quantities of interest in massive graphs. First, we present improved results on the problem of counting triangles in edge streamed graphs. Subsequently, we studied the problem of estimating the size of independent sets in a graph G defined by a stream of edges.

• Veracity: We developed techniques that allow the efficient checking of outsourced data computations. As the popularity of outsourced computation increases, questions of accuracy and trust between the client and the cloud computing services become ever more relevant. Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory and costs are kept to a minimum. Our verification protocols are based on defining “proofs” which are easy to create and check. These add only a small overhead to reporting the result of the computation itself. We build up a series of protocols for elementary statistical methods, to create more complex protocols for Ordinary Least Squares, Principal Component Analysis and Linear Discriminant Analysis.

• Volume: We considered the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors.

The results of the project have been widely disseminated, in conferences and journals, through workshops and seminars, and in the form of a published book.

• Velocity: We developed new techniques for estimating the size of the maximum matching in a graph. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs. In particular, we present a one-pass algorithm that takes O(log n) space and c-approximates the size of the maximum matching in graphs with arboricity within a factor of O(c). This improves significantly upon the state-of-the-art O(n^2/3)-space streaming algorithms, and is the first poly-logarithmic space algorithm for this problem.

• Variety: We developed techniques for approximating important quantities of interest in massive graphs. First, we present improved results on the problem of counting triangles in edge streamed graphs. Subsequently, we studied the problem of estimating the size of independent sets in a graph G defined by a stream of edges.

• Veracity: We developed techniques that allow the efficient checking of outsourced data computations. As the popularity of outsourced computation increases, questions of accuracy and trust between the client and the cloud computing services become ever more relevant. Our work aims to provide fast and practical methods to verify analysis of large data sets, where the client’s computation and memory and costs are kept to a minimum. Our verification protocols are based on defining “proofs” which are easy to create and check. These add only a small overhead to reporting the result of the computation itself. We build up a series of protocols for elementary statistical methods, to create more complex protocols for Ordinary Least Squares, Principal Component Analysis and Linear Discriminant Analysis.

• Volume: We considered the construction and maintenance of machine learning models over data that is large, multi-dimensional, and evolving. In this setting, as well as computational scalability and model accuracy, we also need to minimize the amount of communication between distributed processors, which is the chief component of latency. We study Bayesian networks, the workhorse of graphical models, and present a communication-efficient method for continuously learning and maintaining a Bayesian network model over data that is arriving as a distributed stream partitioned across multiple processors.

The results of the project have been widely disseminated, in conferences and journals, through workshops and seminars, and in the form of a published book.

The research plan had four interdependent components, that each focus on how summaries can address the four challenges: velocity, variety, volume and veracity. There is no single summary which accurately captures all properties of a data set, even approximately. Thus, at the heart of the study of small summaries are the questions of what should be preserved? and how accurately can it be preserved?. The answer to the first question determines which of many different possible summary types may be appropriate, or indeed whether any compact summary even exists. The answer to the second question can determine the size and processing cost of working with the summary in question. Hence, there were, and remain, many problems to solve in this burgeoning area. Substantial progress was made on these grand challenge research questions over the duration of the project.