European Commission logo
English English
CORDIS - EU research results

Interactive Extreme-Scale Analytics and Forecasting

Periodic Reporting for period 2 - INFORE (Interactive Extreme-Scale Analytics and Forecasting)

Reporting period: 2020-07-01 to 2022-03-31

At an increasing rate, industrial and scientific institutions need to deal with massive data flows, streaming-in from a multitude of sources. Processing, analyzing, and learning from such data often requires immense computational power on, and processing across, several Big Data platforms and/or High-Performance Computing (HPC) infrastructures. What is also required are sophisticated analytics tools, capable of extracting insights on-the-fly, from a multitude of voluminous, correlated, high-velocity data streams. Such tools would allow a data analyst to process data and the extracted insights on it in an interactive manner, with very fast response times to desired analytics tasks. To allow for proactive decision-making, predictive analytics tools, that allow to forecast future events of interest are also required. The performed analysis should be made available to all data analysts, who often do not possess the necessary programming skills to code, optimize and debug data processing operations over Big Data.

At INFORE, we addressed all these challenges through several ambitious objectives. We first designed novel data summarization and approximate query processing techniques, as well as real-time, interactive machine learning and data mining tools, supporting the interactive construction of highly accurate models from extreme-scale data streams and massive data volumes. We also developed novel distributed complex event forecasting techniques, allowing not only for the timely detection of critical events as they occur, but also, forecast their future occurrences. INFORE allows users to easily compose data analytics pipelines though its flexible, pluggable and extendable architecture, supported by corresponding software stacks. This architecture allows non-programmer data analysts to specify processing workflows and data analytics tasks, often with no coding required. This framework consists of a family of data processing operators that can be graphically interconnected to provide a family of complex data processing tasks, while an optimizer module guides all optimization and runtime adaptation decisions. The approach of INFORE was subject to rigorous testing and evaluation, involving controlled experiments and reviews by domain experts, with real life data from the financial, the maritime and the life sciences domains, highlighted by the INFORE22 sea trial.
INFORE reached its ambitious objectives. In particular:
● We designed and developed an extensible and highly scalable Synopsis Data Engine (SDE), capable of summarizing massive, high velocity data arriving at dispersed locations/clusters, while also providing very fast response times to user queries. Our SDE has been designed to enable enhanced horizontal, vertical and federated scalability.
● We designed and developed the Online Machine Learning and Data Mining (OMLDM) component of the INFORE platform. This component implements the state-of-the-art in distributed, online ML adopting a parameter server paradigm for incremental training of models, while at the very same time, previously extracted models are deployed for analysis and inference purposes. We then also designed and implemented an abstract actor-like middleware, which shields the computational code from the underlying computational fabric and ported the OMLDM Parameter Server pipeline completely on the new middleware. Leveraging the new middleware, the state-of-the-art DeepLearning4J library can be executed in distributed OML mode within OMLDM, making OMLDM the most versatile OML framework currently in use.
● We developed a new framework for Complex Event Forecasting (CEF), based on the use of symbolic automata and a variable-order Markov model. Our CEF framework is capable of capturing long-term dependencies in a stream and outperforms previous state-of-the-art CEF solutions on accuracy. We developed a distributed version of the CEF module, built on top of Apache Flink, which also has the ability to perform learning in an online manner. We have also been able to automatically fine-tune our models’ hyper-parameters, through the use of Bayesian optimization.
● We designed the INFORE Architecture in a way that enables it to approach maximum flexibility, pluggability, and extensibility. The designed workflow can make use of any algorithm implemented within the components of the INFORE architecture, as well as custom, user-defined operators, with potentially available implementations on a variety of platforms.
● We designed and implemented a visual approach for creating data stream processing workflows and data analytics tasks that enables non-programmer data analysts to create data streaming workflows visually, with zero coding required. Our developed INFORE optimizer optimizes the execution of this workflow, which is then converted into streaming jobs specific to the available distributed Big Data platforms and clusters used. Moreover, the running workflows are subject to continuous optimization, with runtime adaptation having been implemented.
● We performed a successful final evaluation of our system with several expert users from the life sciences, financial and maritime domains. On top of this, at the financial use case, a total of 284 questionnaires was sent out to target users, resulted in a participation of 49 users. The highlight of our testing and evaluation occurred during our final INFORE22 trial, which tested our system in real-world conditions.
The corresponding developed INFORE components made substantial contributions compared to the state-of-the-art. In particular:
● Our designed SDE supports a large number of (and all commonly used) synopses and implements a Synopsis-as-a-Service (termed SDEaaS) paradigm. Our novel SDEaaS approach allows one constantly running SDE job to: (i) accept on-the-fly requests for maintaining new synopsis, (ii) dynamically enhance its functionality by plugging-in external, new synopsis definitions customizing the SDE to application field needs at runtime with zero downtime and (iii) reuse each available synopsis within multiple, concurrent application workflows, instead of duplicating respective data streams and synopses for each workflow. To our knowledge, our approach is the first to support such a large number of data synopses, to serve all discussed types of scalability and combine such SDEaaS facilities.
● Our designed OMLDM component enables efficient federated learning and supports both synchronous and asynchronous communication protocols among the distributed learner tasks.
● Our designed CEF framework can capture long-term dependencies in a stream, and achieves better accuracy scores compared to previous, state-of-the-art CEF solutions.
● The INFORE optimizer supports optimization over several available computing platforms and provides capabilities for cost estimation over arbitrary operators through a black-box approach, along with runtime adaptation.

We expect that the outcome of INFORE, which is a fully integrated prototype that will facilitate the needs of data analysts, will help boost their productivity, and which may lead to an increase in the efficiency and effectiveness of several application domains (i.e. market feed or stock processing, maritime monitoring, biological process monitoring, electronic trading, network and infrastructure monitoring, fraud detection in telecommunications and finance, command and control in dynamic environments, etc) in the EU.