Skip to main content

Scalable online machine learning for predictive analytics and real-time interactive visualization

Periodic Reporting for period 2 - PROTEUS (Scalable online machine learning for predictive analytics and real-time interactive visualization)

Reporting period: 2017-06-01 to 2018-11-30

As enormous volumes of data are created in real time, new requirements and interest is placed on extracting knowledge from massive data sets. In industrial settings, Big Data streams from ubiquitous sensors are collected and aggregated for further processing. Different processes are creating a new generation of networked data.
Big Data comes with new technical and processing requirements. Adapted processing methods and tools are needed. Big Data has the potential to become an efficient medium to address societal challenges in areas such as health, environment, life sciences, etc. and a main enabler for the digital economy. Harnessing its potential gives the EU industry a competitive edge for creating economic growth and jobs.
In this context, PROTEUS aims to contribute a big data analytics platform that can grow and evolve on top of existing emerging big data platforms. In particular, PROTEUS mission is to develop scalable online machine learning techniques for predictive analytics and real-time interactive visualization, considering scalability, usability and effectiveness when dealing with extremely large data sets and streams. These are integrated into an enhanced version of Apache Flink.
PROTEUS addresses specific challenges related to the scalability and responsiveness of analytics capabilities:
1) Enabling real-time machine learning for advanced massive stream data analytics. Scalable streaming algorithms cover classification, clustering, regression, basic sketches, and approximate online linear algebra routines, while advanced ones include novelty and change detection, semi-supervised, and active learning. A library is built, and usability is reinforced through declarative languages and APIs.
2) Enabling real-time hybrid computation (batch data and data streams). A processing engine for managing a hybrid-merge mode serves as basis for the real-time (machine learning & visual) analytics. Optimized architectures are conceived for automated distribution of tasks over clusters for effective stream and historical data mining.
3) Enabling real-time interactive visual analytics in big data. New ways of presenting information make knowledge derived from extremely large and/or streaming data valuable and actionable. An incremental approach enables to achieve low-latency advanced visualizations and interactions. Ultimately, a novel web-based visualization library enables a disruptive approach to visual analysis of (big) data.
4) Enhancing Apache Flink, an EU-originated and emerging Big Data platform. The innovations are integrated and tailored to Apache Flink.
5) Achieving industrial demonstration in an operational environment. User-defined predictive analytics challenges from the steelmaking industrial process drive the project. Feedback with realistic conditions ensure demonstration of ready to use technology. Besides, benchmarks ensure the technology is context-independent and applicable in other domains.
PROTEUS follows an iterative, scenario-guided approach, through incremental stages. Initially, the project detailed the requirements, objectives, as well as the Data Management Plan. This was achieved by a complete catalogue of functional (types of data, type of operations, type of queries, etc.) and non-functional requirements (response time, effectiveness, etc.) derived of the industrial scenario, and a Data Management plan to manage training data and real data during the evaluation tests. Then the setup of the validation scenario was prepared, including benchmark and KPIs. This resulted on detailed description of how to execute the new technology in the industrial system. Finally, by the time of this report, a first prototype of the system was put into testing. An integrated processing engine for analysing data-at-rest and data-in-motion in a hybrid-merge way within Apache Flink data platform was deployed as the basis, and a prototype version of the stream analytics library was integrated on top, enabling operations (moments, heavy hitters, event detection, subsampling, semiring statistics) providing actionable insights for decision making.
PROTEUS follows an Open Source policy to release core outcomes, using GitHub as the code repository, open to the community (https://github.com/PROTEUS-H2020). The key available results include:
A) PROTEUS Engine: an overhauled version of Apache Flink supporting hybrid computation on batch datasets and data streams.
B) PROTEUS Language: a declarative language library for Scalable Data Analysis.
C) PEACH - Proteus Elastic Cache: the PROTEUS Elastic distributed caché.
D) PROTEUS Incremental Analytics: a backend module that implements incremental version (~O(1)) computational cost using approximations) of most common analytics operations.
E) SOLMA - Scalable Online Machine Learning and Data Mining Algorithms: a scalable library adapted to the data analytics platform Apache Flink consisting of efficient distributed online algorithms for basic utilities, sketches as well as advanced online predictive analytics.
F) PROTEIC.JS: an HTML5 and CSS3 charts library, ready for batch and streaming data.
PROTEUS provides innovative contributions in several areas related to big data:
1) In scalable hybrid architectures, PROTEUS researches on processing models supporting data analysis for long running batch processing, focusing on stability and fault tolerance combined with a streaming model aiming for low-latency and high throughput. Integrating these within Apache Flink enables statistical calculation but also advanced algorithms supporting updating Machine Learning models. This is accompanied by a novel declarative language for expressing analytical queries for batch, online or hybrid processing mode computations, and automatic optimization techniques for distributed/parallel tasks to optimize the actual execution of the program.
2) In scalable online machine learning, PROTEUS provides new techniques for ensuring scalability of basic stream sketches (moments, heavy hitters, event detection, subsampling, semiring statistics), but also advanced online learning (clustering, classification, prediction and detection), adapting different strategies including extreme learning, online deep learning and prequential learning. Features are integrated as a user-friendly library named SOLMA (Scalable OnLine Machine learning Algorithms).
3) In real-time interactive visualization, PROTEUS provides disruptive data visualization guidelines for big data taking into account visualization of in-motion/large/unstructured data. A software architecture for real-time interactive visualization was designed, and a novel visualization library for big data built, named Proteic.js (https://proteic.js.org/).
Through these contributions, the PROTEUS impact is manifold: from an EU strategic perspective, by reducing the gap and dependency from the US technology empowering the EU Big Data industry; from an economic perspective, by fostering the development of new skills, creation of new job positions and opportunities, towards economic growth; and from an industrial perspective, by demonstrating in a real industrial scenario (steelmaking) how the technology advancement can be applied to an industrial case to solve their needs, while at the same time demonstrates flexibility and scalability as to be applied in other potential domains.