## Periodic Reporting for period 3 - SSBD (Small Summaries for Big Data)

Reporting period: 2018-05-01 to 2019-10-31

In both business and scientific communities it is agreed that “big data” holds both tremendous promise, and substantial challenges. There is much potential for extracting useful intelligence and actionable information from the large quantities of data generated and captured by modern information processing systems.

The challenges are expressed in terms of “four V’s”: volume, variety, velocity, and veracity. That is, big data challenges involve not only the sheer volume of the data, but the fact that it can represent a complex variety of entities and interactions between them, and new observations that arrive, often across multiple locations, at high velocity, and with uncertain provenance. Some examples of applications that generate big data include:

* Physical Data. The growing development of sensors and sensor deployments have led to settings where measurements of the physical world are available at very high dimensionality and at a great rate. Scientific measurements are the cutting edge of this trend. Astronomy data gathered from modern telescopes can easily generate terabytes of data in a single night. Aggregating large quantities of astronomical data provides a substantial big data challenge to support the study and discovery of new phenomena. Big data from particle physics experiments is also enormous: each experiment can generate many terabytes of readings, which can dwarf what is economically feasible to store for later comparison and investigation.

* Medical Data. It is increasingly feasible to sequence entire genomes. A single genome is not so large—it can be represented in under a gigabyte—but considering the entire raw genetic data of a large population represents a big data challenge: data sets of the order of 200TB in one recent example. This deluge of data may be accompanied by increasing growth in other forms of medical data, based on monitoring multiple vital signs for many patients at fine granularity. Collectively, this leads to the area of data-driven medicine, seeking better understanding of disease, and leading to new treatments and interventions, personalized for each patient.

* Activity Data. Human activity data is being captured and stored in ever greater quantities and levels of detail. Online social networks record not just friendship relations but interactions, messages, photos and interests. Location data is also more available, due to mobile devices which can obtain GPS information and ever more instrumented smart cities. Other electronic activities, such as patterns of website visits, email messages and phone calls are routinely collected and analyzed. Collectively, this provides ever-larger collections of activity information. Service providers who can collect this data seek to make sense of it in order to identify patterns of behavior or signals of behavioral change, and opportunities for advertising and marketing.

* Business Data. Businesses are increasingly able to capture more and complex data about their customers. For example, online stores can track millions of customers as they explore their site, and seek patterns in purchasing and interest, with the aim of providing better service, and anticipating future needs. The detail level of data is getting finer and finer. Previously, data would be limited to just the items purchased, but now extends to more detailed shopping and comparison activity, tracking the whole path to purchase.

Across all of these disparate settings, certain common themes emerge. The data in question is large, and growing. The applications seek to extract patterns, trends or descriptions of the data. Ensuring the scalability of systems, and the timeliness and veracity of the analysis is vital in many of these applications. In order to realize the promise of these sources of data, we need new methods that can handle them effectively.

While such sources of big data are becoming increasingly common, the resources to process them (chiefly, processo

The challenges are expressed in terms of “four V’s”: volume, variety, velocity, and veracity. That is, big data challenges involve not only the sheer volume of the data, but the fact that it can represent a complex variety of entities and interactions between them, and new observations that arrive, often across multiple locations, at high velocity, and with uncertain provenance. Some examples of applications that generate big data include:

* Physical Data. The growing development of sensors and sensor deployments have led to settings where measurements of the physical world are available at very high dimensionality and at a great rate. Scientific measurements are the cutting edge of this trend. Astronomy data gathered from modern telescopes can easily generate terabytes of data in a single night. Aggregating large quantities of astronomical data provides a substantial big data challenge to support the study and discovery of new phenomena. Big data from particle physics experiments is also enormous: each experiment can generate many terabytes of readings, which can dwarf what is economically feasible to store for later comparison and investigation.

* Medical Data. It is increasingly feasible to sequence entire genomes. A single genome is not so large—it can be represented in under a gigabyte—but considering the entire raw genetic data of a large population represents a big data challenge: data sets of the order of 200TB in one recent example. This deluge of data may be accompanied by increasing growth in other forms of medical data, based on monitoring multiple vital signs for many patients at fine granularity. Collectively, this leads to the area of data-driven medicine, seeking better understanding of disease, and leading to new treatments and interventions, personalized for each patient.

* Activity Data. Human activity data is being captured and stored in ever greater quantities and levels of detail. Online social networks record not just friendship relations but interactions, messages, photos and interests. Location data is also more available, due to mobile devices which can obtain GPS information and ever more instrumented smart cities. Other electronic activities, such as patterns of website visits, email messages and phone calls are routinely collected and analyzed. Collectively, this provides ever-larger collections of activity information. Service providers who can collect this data seek to make sense of it in order to identify patterns of behavior or signals of behavioral change, and opportunities for advertising and marketing.

* Business Data. Businesses are increasingly able to capture more and complex data about their customers. For example, online stores can track millions of customers as they explore their site, and seek patterns in purchasing and interest, with the aim of providing better service, and anticipating future needs. The detail level of data is getting finer and finer. Previously, data would be limited to just the items purchased, but now extends to more detailed shopping and comparison activity, tracking the whole path to purchase.

Across all of these disparate settings, certain common themes emerge. The data in question is large, and growing. The applications seek to extract patterns, trends or descriptions of the data. Ensuring the scalability of systems, and the timeliness and veracity of the analysis is vital in many of these applications. In order to realize the promise of these sources of data, we need new methods that can handle them effectively.

While such sources of big data are becoming increasingly common, the resources to process them (chiefly, processo

The project SSBD is a concerned with developing novel algorithms and methodologies for the management and understanding of large volumes of data. The plan of work for this project is broken down into four themes: Velocity (New summaries for fundamental computations); variety (new summaries for complex structures); veracity (summaries for outsourced computations); and volume (distributed summaries). Good progress has been made on all four of these themes in the initial period of the project, with initial outputs in the form of papers appearing the scientific literature. We identify the highlights of the achievements under each theme:

• Velocity: We have developed new techniques for estimating the size of the maximum matching in a graph. This is a canonical problem in graph analysis, and one that has attracted extensive study over a range of different computational models. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs. In particular, we present a one-pass algorithm that takes O(log n) space and c-approximates the size of the maximum matching in graphs with arboricity within a factor of O(c). This improves significantly upon the state-of-the-art O(n^2/3)-space streaming algorithms, and is the first poly-logarithmic space algorithm for this problem. In contrast to prior work, our results take more advantage of the streaming access to the input and characterize the matching size based on the ordering of the edges in the stream in addition to the degree distributions and structural properties of the sparse graphs.

In ongoing work, we have developed algorithms for identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. We quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach. This work is accepted for publication in the International Conference on Database Theory 2018, and a technical report is available on the ArXiv repository.

• Variety: We have developed techniques for approximating important quantities of interest in massive graphs. First, we present improved results on the problem of counting triangles in edge streamed graphs. For graphs with m edges and at least T triangles, we show that an extra look over the stream yields a two-pass streaming algorithm that uses small space and outputs a (1+ε) approximation of the number of triangles in the graph. In terms of dependence on T, we show that more passes would not lead to a better space bound. In other words, we prove there is no constant pass streaming algorithm that distinguishes between triangle-free graphs from graphs with at least T triangles using a certain amount of space for any constant ρ≥0.

Subsequently, we have studied the problem of estimating the size of independent sets in a graph G defined by a stream of edges. Our approach relies on the Caro-Wei bound, which expresses the desired quantity in terms of a sum over nodes of the reciprocal of their degrees, denoted by β(G). Our results show that β(G) can be approximated accurately, based on a provided lower bound on β. Stronger results are possible when the edges are promised to arrive grouped by an incident node. In this setting, we obtain a value that is at most a logarithmic factor below the true value of β and no more than the true independent set size. To justify the form of this bound, we also show an Ω(n/β) lower bound on any algorithm that approximates β up to a constant factor. This work is under submission, and a technical report is available on the ArXiv repository.

• Veracity: We have developed techniques that allow the efficient checking of outsourced data co

• Velocity: We have developed new techniques for estimating the size of the maximum matching in a graph. This is a canonical problem in graph analysis, and one that has attracted extensive study over a range of different computational models. We present improved streaming algorithms for approximating the size of maximum matching with sparse (bounded arboricity) graphs. In particular, we present a one-pass algorithm that takes O(log n) space and c-approximates the size of the maximum matching in graphs with arboricity within a factor of O(c). This improves significantly upon the state-of-the-art O(n^2/3)-space streaming algorithms, and is the first poly-logarithmic space algorithm for this problem. In contrast to prior work, our results take more advantage of the streaming access to the input and characterize the matching size based on the ordering of the edges in the stream in addition to the degree distributions and structural properties of the sparse graphs.

In ongoing work, we have developed algorithms for identifying the highly correlated pairs in a collection of n mostly pairwise uncorrelated random variables, where observations of the variables arrives as a stream. We quickly generate sketches of random combinations of signals, and use these in concert with ideas from coding theory to decode the identity of correlated pairs. We prove correctness and compare performance and effectiveness with the best LSH (locality sensitive hashing) based approach. This work is accepted for publication in the International Conference on Database Theory 2018, and a technical report is available on the ArXiv repository.

• Variety: We have developed techniques for approximating important quantities of interest in massive graphs. First, we present improved results on the problem of counting triangles in edge streamed graphs. For graphs with m edges and at least T triangles, we show that an extra look over the stream yields a two-pass streaming algorithm that uses small space and outputs a (1+ε) approximation of the number of triangles in the graph. In terms of dependence on T, we show that more passes would not lead to a better space bound. In other words, we prove there is no constant pass streaming algorithm that distinguishes between triangle-free graphs from graphs with at least T triangles using a certain amount of space for any constant ρ≥0.

Subsequently, we have studied the problem of estimating the size of independent sets in a graph G defined by a stream of edges. Our approach relies on the Caro-Wei bound, which expresses the desired quantity in terms of a sum over nodes of the reciprocal of their degrees, denoted by β(G). Our results show that β(G) can be approximated accurately, based on a provided lower bound on β. Stronger results are possible when the edges are promised to arrive grouped by an incident node. In this setting, we obtain a value that is at most a logarithmic factor below the true value of β and no more than the true independent set size. To justify the form of this bound, we also show an Ω(n/β) lower bound on any algorithm that approximates β up to a constant factor. This work is under submission, and a technical report is available on the ArXiv repository.

• Veracity: We have developed techniques that allow the efficient checking of outsourced data co

The emphasis of the project is on the design and application of novel algorithms, and hence all results are concerned with novel methodologies. We give some examples:

• In collaboration with other researchers, we developed new ideas relating to optimization under stability. In computing, as in many aspects of life, changes incur cost. Many optimization problems are formulated as a one-time instance starting from scratch. However, a common case that arises is when we already have a set of prior assignments and must decide how to respond to a new set of constraints, given that each change from the current assignment comes at a price. That is, we would like to maximize the fitness or efficiency of our system, but we need to balance it with the changeout cost from the previous state. We provided a precise formulation for this tradeoff and analyze the resulting stable extensions of some fundamental problems in measurement and analytics. Our main technical contribution is a stable extension of Probability Proportional to Size (PPS) weighted random sampling, with applications to monitoring and anomaly detection problems. We also provide a general framework that applies to top-k, minimum spanning tree, and assignment. In both cases, we are able to provide exact solutions and discuss efficient incremental algorithms that can find new solutions as the input changes.

• In a separate collaboration, we developed new methodologies linking the areas of kernelization and streaming. We developed a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: (1) Matching: Our main result for matchings is that there exists an O~(k2) space algorithm that returns the edges of a maximum matching on the assumption the cardinality is at most k. The best previous algorithm used O~(kn) space where n is the number of vertices in the graph and we prove our result is optimal up to logarithmic factors. (2) Vertex Cover and Hitting Set: There exists an O~(kd) space algorithm that solves the minimum hitting set problem where d is the cardinality of the input sets and k is an upper bound on the size of the minimum hitting set. We prove this is optimal up to logarithmic factors. Our algorithm has O~(1) update time. The case d=2 corresponds to minimum vertex cover. (3) Finally, we considered a larger family of parameterized problems (including b-matching, disjoint paths, vertex coloring among others) for which our subgraph sampling primitive yields fast, small-space dynamic graph stream algorithms. We then show lower bounds for natural problems outside this family.

• Many topics covered by the project cross disciplinary boundaries, having impact not just within the area of computer science, but also in neighbouring areas such as mathematics and statistics, and on into the social sciences. An example of this is the work on correlation clustering in the dynamic data stream model. Here, the stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We presented polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivit

• In collaboration with other researchers, we developed new ideas relating to optimization under stability. In computing, as in many aspects of life, changes incur cost. Many optimization problems are formulated as a one-time instance starting from scratch. However, a common case that arises is when we already have a set of prior assignments and must decide how to respond to a new set of constraints, given that each change from the current assignment comes at a price. That is, we would like to maximize the fitness or efficiency of our system, but we need to balance it with the changeout cost from the previous state. We provided a precise formulation for this tradeoff and analyze the resulting stable extensions of some fundamental problems in measurement and analytics. Our main technical contribution is a stable extension of Probability Proportional to Size (PPS) weighted random sampling, with applications to monitoring and anomaly detection problems. We also provide a general framework that applies to top-k, minimum spanning tree, and assignment. In both cases, we are able to provide exact solutions and discuss efficient incremental algorithms that can find new solutions as the input changes.

• In a separate collaboration, we developed new methodologies linking the areas of kernelization and streaming. We developed a simple but powerful subgraph sampling primitive that is applicable in a variety of computational models including dynamic graph streams (where the input graph is defined by a sequence of edge/hyperedge insertions and deletions) and distributed systems such as MapReduce. In the case of dynamic graph streams, we use this primitive to prove the following results: (1) Matching: Our main result for matchings is that there exists an O~(k2) space algorithm that returns the edges of a maximum matching on the assumption the cardinality is at most k. The best previous algorithm used O~(kn) space where n is the number of vertices in the graph and we prove our result is optimal up to logarithmic factors. (2) Vertex Cover and Hitting Set: There exists an O~(kd) space algorithm that solves the minimum hitting set problem where d is the cardinality of the input sets and k is an upper bound on the size of the minimum hitting set. We prove this is optimal up to logarithmic factors. Our algorithm has O~(1) update time. The case d=2 corresponds to minimum vertex cover. (3) Finally, we considered a larger family of parameterized problems (including b-matching, disjoint paths, vertex coloring among others) for which our subgraph sampling primitive yields fast, small-space dynamic graph stream algorithms. We then show lower bounds for natural problems outside this family.

• Many topics covered by the project cross disciplinary boundaries, having impact not just within the area of computer science, but also in neighbouring areas such as mathematics and statistics, and on into the social sciences. An example of this is the work on correlation clustering in the dynamic data stream model. Here, the stream consists of updates to the edge weights of a graph on n nodes and the goal is to find a node-partition such that the end-points of negative-weight edges are typically in different clusters whereas the end-points of positive-weight edges are typically in the same cluster. We presented polynomial-time, O(n·polylog n)-space approximation algorithms for natural problems that arise. We first develop data structures based on linear sketches that allow the “quality” of a given node-partition to be measured. We then combine these data structures with convex programming and sampling techniques to solve the relevant approximation problem. Our work presents space-efficient algorithms for the convex programming required, as well as approaches to reduce the adaptivit