Skip to main content

Interactive Extreme-Scale Analytics and Forecasting

Periodic Reporting for period 1 - INFORE (Interactive Extreme-Scale Analytics and Forecasting)

Reporting period: 2019-01-01 to 2020-06-30

At an increasing rate, industrial and scientific institutions need to deal with massive data flows, streaming-in from a multitude of sources. Processing, analyzing, and learning from such data often requires immense computational power on, and processing across, several Big Data platforms and/or High-Performance Computing (HPC) infrastructures. What is also required are sophisticated analytics tools, capable of extracting insights on-the-fly, from a multitude of voluminous, correlated, high-velocity data streams, while also capable of harvesting ever-growing historical data repositories. Such tools would allow a data analyst to process data and the extracted insights on it in an interactive manner, with very fast response times to desired analytics tasks. To allow for proactive decision-making, predictive analytics tools, that allow to forecast future events of interest are also required. The ability to forecast, as early as possible, a good approximation to the outcome of a time-consuming and resource-demanding computational task allows to quickly identify undesired outcomes and save valuable amount of time, effort and computational resources, which would otherwise be spent in vain. Consider, for example, the ability to forecast the outcome of a complex multi-cellular system simulation for tumor evolution, without the need to wait for the simulation to be completed. All this analysis should be made available to all data analysts, who often do not possess the necessary programming skills to code, optimize and debug data processing operations over Big Data.

At INFORE, we address all these challenges through several ambitious objectives. To allow for real-time response times and interactive analytics, we first aim to design novel data summarization and approximate query processing techniques, as well as real-time, interactive machine learning and data mining tools, supporting the interactive construction of highly accurate models from extreme-scale data streams and massive data volumes. We also aim to develop novel distributed complex event forecasting techniques, allowing not only for the timely detection of critical events as they occur, but also, forecast their future occurrences. Since it is crucial to make the interactive data analytics library easy-to-use and increase its effectiveness, e.g. by allowing users to easily compose data analytics pipelines from the existing tools in the library, a further goal of INFORE is to design a flexible, pluggable and extendable architecture, supported by corresponding software stacks. This architecture will be made possible through a carefully constructed framework for supporting non-programmer data analysts to specify processing workflows and data analytics tasks, often with no coding required. This framework will consist of a family of data processing operators that can be graphically interconnected to provide a family of complex data processing tasks with no programming required, while an optimizer module will guide all optimization decisions, such as the Big Data platform or HPC infrastructure where each operator will be executed and job execution parameters. The approach of INFORE will be subject to rigorous testing and evaluation, involving controlled experiments and reviews by domain experts, with real life data from the financial, the maritime and the life sciences domains.
During this period, at INFORE we have made substantial progress towards reaching our objectives:
● We designed and developed an extensible and highly scalable Synopsis Data Engine (SDE), capable of summarizing massive, high velocity data arriving at dispersed locations/clusters, while also providing very fast response times to user queries. Our SDE has been designed to enable enhanced horizontal, vertical and federated scalability.
● We designed and developed the Online Machine Learning and Data Mining (OMLDM) component of the INFORE platform. This component implements the state-of-the-art in distributed, online ML adopting a parameter server paradigm for incremental training of models, while at the very same time, previously extracted models are deployed for analysis and inference purposes.
● We developed a new framework for Complex Event Forecasting (CEF), based on the use of symbolic automata and a variable-order Markov model. Our CEF framework is capable of capturing long-term dependencies in a stream and outperforms previous state-of-the-art CEF solutions on accuracy. Furthermore, we demonstrated that the training/learning task of parameter estimation of the CEF module can be executed in an online and distributed manner.
● We designed the INFORE Architecture in a way that enables it to approach maximum flexibility, pluggability, and extensibility. The designed workflow can make use of any algorithm implemented within the components of the INFORE architecture, as well as custom, user-defined operators, with potentially available implementations on a variety of Big Data platforms and/or High-Performance Computing (HPC).
● We designed and implemented a visual approach for creating data stream processing workflows and data analytics tasks that enables non-programmer data analysts to create data streaming workflows visually, with zero coding required. Our developed INFORE optimizer optimizes the execution of this workflow, which is then converted into streaming jobs specific to the available distributed Big Data platforms and clusters used.
The corresponding developed INFORE components have already made substantial contributions compared to the state-of-the-art. In particular:
● Our designed SDE supports a large number of (and all commonly used) synopses and implements a Synopsis-as-a-Service (termed SDEaaS) paradigm. Our novel SDEaaS approach allows one constantly running SDE job to: (i) accept on-the-fly requests for maintaining new synopsis, (ii) dynamically enhance its functionality by plugging-in external, new synopsis definitions customizing the SDE to application field needs at runtime with zero downtime and (iii) reuse each available synopsis within multiple, concurrent application workflows, instead of duplicating respective data streams and synopses for each workflow. To our knowledge, our approach is the first to support such a large number of data synopses, to serve all discussed types of scalability and combine such SDEaaS facilities.
● Our designed OMLDM component enables efficient federated learning and supports both synchronous and asynchronous communication protocols among the distributed learner tasks.
● Our designed CEF framework can capture long-term dependencies in a stream, and achieves better accuracy scores compared to previous, state-of-the-art CEF solutions.
● The INFORE optimizer supports optimization over several available computing platforms and provides capabilities for cost estimation over arbitrary operators through a black-box approach.

Until the end of the project, the expected outcome is a fully integrated prototype that will facilitate the needs of data analysts, will help boost their productivity, and which may lead to an increase in the efficiency and effectiveness of several application domains (i.e. market feed or stock processing, maritime monitoring, biological process monitoring, electronic trading, network and infrastructure monitoring, fraud detection in telecommunications and finance, command and control in dynamic environments, etc) in the EU.