Final Report Summary - 321526 (Elastic and transparent scaling for stream processing applications)
Stream processing is a computational paradigm that enables carrying out these tasks in an efficient and scalable manner. Streaming applications are programs that process continuous data streams on the fly, as the data flows through the system. They are typically represented as a graph of streams and operators, where operators are generic data manipulators and streams connect operators to each other. To achieve scale, stream processing applications are distributed over a set of machines, often in a cluster environment. The key capabilities of stream processing systems are their ability to handle high number, rate, and variety of data sources with low-latency. They represent a technological shift from the traditional store-and-forward model of computation towards processing on the go.
Stream processing systems are highly suitable for building applications that perform online data analytics, yet their value in practice depends on their ability to provide a flexible and effective platform for executing such analytics. The initial fury of systems research in stream processing have addressed a large number of issues, in areas such as streaming languages, runtime systems, and performance optimizations. However, it has also left serious gaps. In the ETS4SPA project, we have addressed some of these shortcomings. In particular, we have developed systems and techniques to make stream processing a more effective computational paradigm by enabling elastic and transparent parallelization.
Due to the long-running nature of streaming applications and the highly dynamic nature of their workloads, streaming systems must perform performance optimizations adaptively at run-time. Many streaming optimizations have been proposed in the literature, but a holistic approach to combine them in an elastic and transparent manner has been missing. An efficient streaming platform should be able to apply these optimizations online, adjusting the mapping of logical application pieces to the set of resources available in a continuous manner.
In the ETS4SPA project, we have developed techniques to enable runtime adaptation for streaming applications. These included automatic pipeline parallelization, automatic data parallelization (fission), and automatic pipelined fission (combined pipeline and data parallelism). We have also developed two prototype stream processing systems, C-Stream and Joker, to showcase the effectiveness of these techniques in practice. Our results show that streaming applications can be scaled automatically, in a manner that is completely transparent to the application developers and elastically at runtime.
The ETS4SPA project has also supported the establishment of the Bilkent Data-Intensive Distributed Systems Lab (Bil-DIDS) and as part of this lab, supported graduate level studies in the area of streaming systems. The project web site can be found at http://www.cs.bilkent.edu.tr/~bgedik/bildids/doku.php/projects:autoparallel. The research conducted within the Bil-DIDS lab has resulted in several publications in top research journals. Several tutorials have been given both at national and international venues to further disseminate the results from the project.
We expect the results from the ETS4SPA project to greatly contribute to the technological shift needed to address Fast Data challenges. They will also serve as an enabler for wide spread adoption of stream processing systems. We expect the resulting technologies to foster productivity growth in Europe, since Big Data and Fast Data are affecting not only software-intensive industries but also wide spectrum of services.