European Commission logo
English English
CORDIS - EU research results
CORDIS

ENABLING THE BIG DATA PIPELINE LIFECYCLE ON THE COMPUTING CONTINUUM

Periodic Reporting for period 1 - DataCloud (ENABLING THE BIG DATA PIPELINE LIFECYCLE ON THE COMPUTING CONTINUUM)

Reporting period: 2021-01-01 to 2021-12-31

Cloud computing has been a major disruptive technology in the last decade providing resources-as-a-service for diverse Internet applications. While Cloud offers elastic capacity and customisable connectivity over a large-scale network, the resilience, sustainability and human-centric collaborative requirements of Big Data processing applications demand an interoperable end-to-end ecosystem design that pushes the infrastructure services, traditionally bounded within data centres, towards remote nodes closer to the data sources. The Computing Continuum that federates Cloud services with emerging Edge and Fog computing paradigms partially addresses these concerns by reducing overheads of transferring distributed data into remote data centres. From the perspective of Big Data management on the Computing Continuum, eminent challenges in supporting processing of Big Data still remain, including effective discovery, modelling and simulation of Big Data pipelines, and their efficient deployment over heterogeneous resources from different providers, using trustworthy services.

DataCloud aims to create of a novel paradigm for Big Data pipeline processing over heterogeneous resources encompassing the Cloud/Edge/Fog Computing Continuum, covering the complete lifecycle of managing Big Data pipelines. DataCloud aims to make this paradigm easily accessible to a wide set of large and small organizations that encounter difficulties in capitalizing on Big Data due to the lack of suitable processing capabilities. From a technical perspective, the goal is to develop a software toolbox–the DataCloud toolbox–comprising of new languages, methods, infrastructures, and software prototypes for discovering, simulating, deploying, and adapting Big Data pipelines on heterogeneous and untrusted resources in a manner that makes execution of Big Data pipelines traceable, trustable, manageable, analyzable, and optimizable.
The focus over the first period of DataCloud was on creating the initial implementation of the DataCloud toolbox through a structured process of requirements gathering based on interview-based analysis of the project Business Cases. Thereby, we have identified the specific key stakeholders of each organization and how they map to the main roles (i.e. Business Domain Expert, Data Scientist, Computing Continuum Infrastructure Maintainer) in the Big Data pipeline lifecycle as well as their goals and the perceived benefits of using each tool. The requirements identified in the process were used to inform the design and implementation of the features of the first version of the DataCloud toolbox. In addition to functional features, the project has developed a detailed architecture of the toolbox and has identified (and in some cases implemented) the necessary APIs that will be used as points of integration of the individual tools. A first version of the domain-specific language (DSL) for Big Data pipelines on the Computing Continuum has been developed and deployed in order to enable pipeline definition and the technical integration across the DataCloud toolbox. Furthermore, the project has successfully developed initial design and/or implementation of the tools for Big Data pipeline discovery, definition, simulation, resource provisioning, deployment and runtime adaptation.

Apart from the technical results, the project has advanced the design of five Business Cases to demonstrate the use of the DataCloud toolbox. The project has delivered a detailed specification technical specification of the Business Cases (including the market requirements) and details about how the toolbox will be used to implement them. Furthermore, the project has produced a detailed definition of the Big Data pipelines that will be developed for each business case and has identified and started the implementation of the core components and services that are involved. The five Business Cases include: 1) Smart Mobile Marketing Campaigns (SMARK), 2) Automatic Live Sports Content Annotation (MOGSPORTS), 3) Digital Health System (TLUHEALTH), 4) Products Development in Ceramic Engineering (P-DICE), and 5) Analytics of Manufacturing Assets (AMANS).
DataCloud is advancing the state of the art in areas relevant for managing Big Data pipelines on the Computing Continuum: 1) Pipeline Discovery (algorithms to extract well-formed event logs from heterogeneous sources, technique to learn the data pipelines underlying processes, scalable AI-based approach to perform Big Data pipeline-driven analytics); 2) Pipeline Definition (graphical compositional DSL to define Big Data pipelines, provision of meta-pipelines libraries and extensible and pluggable container templates for pipeline steps); 3) Pipeline Simulation (approach for simulating pipelines, to find the most cost-effective distribution of compute resources); 4) Blockchain-based resources provisioning (oracles capable of assessing, in a decentralized manner, the properties of the hardware on which pipelines will be executed; improvement of off-chain computing protocols for supporting interdependent and stateful pipeline steps); 5) Pipeline Deployment (orchestration mechanism with auto-scaling features to deploy and execute pipelines and adapt them with respect to data-drifts or exogenous events); 6) Pipeline Adaptation (user and provider-centered approach for improved and adaptive pipeline deployment, utilizing resources across various control domains).

Expected results until the end of the project include six tools: DIS-PIPE (discovery of the structure of Big Data pipelines from data sources); DEF-PIPE (textual and graphical description of Big Data pipelines); SIM-PIPE (simulation of container-based Big Data pipelines); R-MARKET (provisioning resources from the Computing Continuum); DEP-PIPE (adaptive, secure and scalable orchestration of data pipelines); ADA-PIPE (pipeline scheduling and adaptation). These tools will be combined in the DataCloud toolbox, which will be validated in five business products and services: SMARK (smart data pipeline implementation for mobile digital marketing campaigns management); MOGSPORT (platform for automatic metadata enrichment in live sports events); P-DICE (framework for discovering production data pipelines in the sanitary-ware industry); TLUHEALTH (Telecare/Telehealth services provided as SaaS); AMANS (analytics of manufacturing assets).

The potential impact is targeted on various groups: Data/ICT Industries (time/cost saved in using Big Data pipelines, easier Big Data pipeline lifecycle management for relevant stakeholders, optimization of Big Data processing); Data Scientists (seamless use of the Computing Continuum infrastructure for deploying Big Data pipelines); Business experts (possibility to get involved in the process of definition, simulation, and deployment of Big Data pipelines); DevOps/DataOps (increased productivity and quality of system deployment and maintenance); Resource providers (novel ways to monetize their resources in a resource marketplace), Policy makers (more effective decision-making procedures based on cross-sectorial Big Data and heterogeneous infrastructures), Entrepreneurs (increased business opportunities related to innovative services and apps), Society at large (advancing research and applying innovative technologies that take the best of breed from the Big Data and Computing Continuum domains).
Logo