CORDIS - EU research results
CORDIS

Elastic and transparent scaling for stream processing applications

Final Report Summary - 321526 (Elastic and transparent scaling for stream processing applications)

In a world where "data is the new oil", Big Data has become an important driver for innovation and growth that relies on disruptive technologies. These technologies aim to cope up with the sheer volume of data we need to sift through in order to derive actionable insights. An important part of Big Data is Fast Data — data that flows continuously from online sources, such as software and hardware sensors; in short, data streams. Examples of data streams can be found in several domains, such as live stock ticker data in financial markets, call detail records in telecommunications, video streams in surveillance, production line status feeds in manufacturing, and vital body signals in health-care. In all of these domains there is a need to gather, process, and analyze data streams, detect emerging patterns and outliers, extract valuable insights, and generate actionable results. Most importantly, this analysis often needs to happen in near real-time.

Stream processing is a computational paradigm that enables carrying out these tasks in an efficient and scalable manner. Streaming applications are programs that process continuous data streams on the fly, as the data flows through the system. They are typically represented as a graph of streams and operators, where operators are generic data manipulators and streams connect operators to each other. To achieve scale, stream processing applications are distributed over a set of machines, often in a cluster environment. The key capabilities of stream processing systems are their ability to handle high number, rate, and variety of data sources with low-latency. They represent a technological shift from the traditional store-and-forward model of computation towards processing on the go.

Stream processing systems are highly suitable for building applications that perform online data analytics, yet their value in practice depends on their ability to provide a flexible and effective platform for executing such analytics. The initial fury of systems research in stream processing have addressed a large number of issues, in areas such as streaming languages, runtime systems, and performance optimizations. However, it has also left serious gaps. In the ETS4SPA project, we have addressed some of these shortcomings. In particular, we have developed systems and techniques to make stream processing a more effective computational paradigm by enabling elastic and transparent parallelization.

Due to the long-running nature of streaming applications and the highly dynamic nature of their workloads, streaming systems must perform performance optimizations adaptively at run-time. Many streaming optimizations have been proposed in the literature, but a holistic approach to combine them in an elastic and transparent manner has been missing. An efficient streaming platform should be able to apply these optimizations online, adjusting the mapping of logical application pieces to the set of resources available in a continuous manner.

In the ETS4SPA project, we have developed techniques to enable runtime adaptation for streaming applications. These included automatic pipeline parallelization, automatic data parallelization (fission), and automatic pipelined fission (combined pipeline and data parallelism). We have also developed two prototype stream processing systems, C-Stream and Joker, to showcase the effectiveness of these techniques in practice. Our results show that streaming applications can be scaled automatically, in a manner that is completely transparent to the application developers and elastically at runtime.

The ETS4SPA project has also supported the establishment of the Bilkent Data-Intensive Distributed Systems Lab (Bil-DIDS) and as part of this lab, supported graduate level studies in the area of streaming systems. The project web site can be found at http://www.cs.bilkent.edu.tr/~bgedik/bildids/doku.php/projects:autoparallel. The research conducted within the Bil-DIDS lab has resulted in several publications in top research journals. Several tutorials have been given both at national and international venues to further disseminate the results from the project.

We expect the results from the ETS4SPA project to greatly contribute to the technological shift needed to address Fast Data challenges. They will also serve as an enabler for wide spread adoption of stream processing systems. We expect the resulting technologies to foster productivity growth in Europe, since Big Data and Fast Data are affecting not only software-intensive industries but also wide spectrum of services.