In both business and scientific communities it is agreed that “big data” holds both tremendous promise, and substantial challenges. There is much potential for extracting useful intelligence and actionable information from the large quantities of data generated and captured by modern information processing systems.
The challenges are expressed in terms of “four V’s”: volume, variety, velocity, and veracity. That is, big data challenges involve not only the sheer volume of the data, but the fact that it can represent a complex variety of entities and interactions between them, and new observations that arrive, often across multiple locations, at high velocity, and with uncertain provenance. Some examples of applications that generate big data include physical data, medical data, human activity data and business data. Across all of these disparate settings, certain common themes emerge. The data in question is large, and growing. The applications seek to extract patterns, trends or descriptions of the data. Ensuring the scalability of systems, and the timeliness and veracity of the analysis is vital in many of these applications. In order to realize the promise of these sources of data, we need new methods that can handle them effectively.
While such sources of big data are becoming increasingly common, the resources to process them (chiefly, processor speed, fast memory and slower disk) are growing at a slower pace. The consequence of this trend is that there is an urgent need for more effort directed towards capturing and processing data in many critical applications. Careful planning and scalable architectures are needed to fulfill the requirements of analysis and information extraction on big data. In response to these needs, new computational paradigms are being adopted to deal with the challenge of big data. Large scale distributed computation is a central piece: the scope of the computation can exceed what is feasible on a single machine, and so clusters of machines work together in parallel. On top of these architectures, parallel algorithms are designed which can take the complex task and break it into independent pieces suitable for distribution over multiple machines.
A central challenge within any such system is how to compute and represent complex features of big data in a way that can be processed by many single machines in parallel. A vital component is to be able to build and manipulate a compact summary of a large amount of data. This powerful notion of a small summary, in all its many and varied forms, is the subject of study of this proposal. The idea of a summary is a natural and familiar one. It should represent something large and complex in a compact fashion. Inevitably, a summary must dispense with some of the detail and nuance of the object which it is summarizing. However, it should also preserve some key features of the object in a very accurate fashion.
The project aims were to promulgate new theory and practice in the area of Small Summaries for Big Data (SSBD). The vision is to substantially advance the state of the art in data summarization, to the point where accurate and effective summaries are available for a wide range of problems, and can be used seamlessly in applications that process big data. While the mathematics needed for this work can be complex, at the same time the power of the results is great. The resulting algorithms can be simple enough to be written in a few lines of code, and have applications that cover many different areas of business and society, affecting millions of people. The aim is to develop efficient, scalable methods that can be used by any organization to analyze its “big data”, rather than just big businesses.