ENABLING THE BIG DATA PIPELINE LIFECYCLE ON THE COMPUTING CONTINUUM

Informations projet

DataCloud

N° de convention de subvention: 101016835

DOI

10.3030/101016835

Projet clôturé

Date de signature de la CE 16 Decembre 2020

Date de début 1 Janvier 2021

Date de fin 31 Decembre 2023

Financé au titre de

INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT)

Coût total

€ 4 999 996,25

Contribution de l’UE

€ 4 999 996,25

4 999 996,25

Coordonné par

SINTEF AS
Norway

Periodic Reporting for period 2 - DataCloud (ENABLING THE BIG DATA PIPELINE LIFECYCLE ON THE COMPUTING CONTINUUM)

Période du rapport: 2022-01-01 au 2023-12-31

Cloud computing has been a major disruptive technology in the last decade providing resources-as-a-service for diverse Internet applications. While Cloud offers elastic capacity and customisable connectivity over a large-scale network, the resilience, sustainability and human-centric collaborative requirements of Big Data processing applications demand an interoperable end-to-end ecosystem design that pushes the infrastructure services, traditionally bounded within data centres, towards remote nodes closer to the data sources. The Computing Continuum that federates Cloud services with emerging Edge and Fog computing paradigms partially addresses these concerns, but from the perspective of Big Data management on the Computing Continuum, eminent challenges in supporting processing of Big Data remain.

DataCloud created a novel paradigm for Big Data pipeline processing over heterogeneous resources encompassing the Cloud/Edge/Fog Computing Continuum, covering the complete lifecycle of managing Big Data pipelines. DataCloud makes this paradigm easily accessible to a wide set of large and small organizations that encounter difficulties in capitalizing on Big Data. From a technical perspective, the goal is to develop a software toolbox comprising of new languages, methods, infrastructures, and software prototypes for discovering, simulating, deploying, and adapting Big Data pipelines on heterogeneous and untrusted resources in a manner that makes execution of Big Data pipelines traceable, trustable, manageable, analyzable, and optimizable.

At the end of the project, DataCloud delivers the final implementation of the DataCloud toolbox that covers both design- and run-time aspects of Big Data pipelines deployment. The toolbox features have been developed considering requirements from five different and diverse Business Cases, a study of the state of the art, as well as the validation feedback from project partners and external users as well as the first reporting period project review. The tools are designed to be used either as an integrated set of components using the DataCloud integrated UI, or separately, as stand-alone tools supporting well-established standards and technologies. The toolbox is comprised of six different tools: DIS-PIPE for pipeline discovery and conformance checking; DEF-PIPE for data pipeline specification and parametrization; SIM-PIPE for pipeline testing, validation, simulation and configuration before deployment; R-MARKET for resource provisioning of trusted and untrusted resources; DEP-PIPE for Big Data pipeline deployment and ADA-PIPE for pipeline scheduling and run-time adaptation. In addition, the project delivered a domain-specific language that extends traditional data workflow specification standards to include additional features needed for both design- and run-time support for Big Data pipelines.

To demonstrate the usability and usefulness of the toolbox, the project implemented and deployed five new Business Cases that make use of all DataCloud tools. The pipelines cover a variety of tasks from digital marketing, live media streaming, electronic healthcare, manufacturing and Industry 4.0. Each business case specified, implemented and deployed one or more Big Data pipelines that were incorporated in partners’ heterogeneous technical infrastructures to produce business value. The pipelines were implemented through a collaboration between domain experts, data engineers and DataOps specialists, thus demonstrating the ability of the toolbox to support a wide range of stakeholders. Specifically, SMARK developed and implemented a data pipeline for digital marketing, validated tools for data exploitation, and disseminated results via social media and events, focusing on internal usage for marketing campaigns. MOGSPORTS fully integrated its sports analytics tools with DataCloud, validated through focus groups and pilots at football matches, and outlined an exploitation plan in communications and dissemination efforts. TLUHEALTH advanced remote patient monitoring, validated DataCloud tools through pilots with real customers, leading to commercial contracts, and contributed to scientific publications on data pipelines for patient monitoring. P-DICE improved manufacturing production planning through process mining and cloud computing, validated by stakeholders including plant and production managers, and shared results through dissemination activities. AMANS completed toolkit deployment for welding processes, validated through internal and market assessments, and significantly contributed to scientific conferences and publications, highlighting data science solutions in manufacturing.

DataCloud‘s engagement with the wider community has been implemented through a number of channels, including participation in physical events, online presence (video presentations an interviews, blog posts, news articles, press releases, social media posts, etc.), project collaborations and industrial organization participations. The project performed advertising of project results with industrial community through participation in four industrial organizations. DataCloud also engaged in extensive collaboration with nine H2020 and HEU projects – advertising project results, incorporating technical concepts related to Big Data Pipelines, integrating tools within project technical architecture. In terms of the scientific community, project results, including the project‘s business cases have resulted in more than 60 scientific publications.

With the DataCloud toolbox, the project advances the state of the art in managing Big Data pipelines on the Computing Continuum. DIS-PIPE introduced an AI-based automated planning for data pipeline model discovery and alignment, enhanced with customized process mining, scalable trace alignment, advanced visualization techniques, and log filtering features, significantly advancing the state of the art in data pipeline management and analysis. DEF-PIPE improved pipeline development efficiency through a visual tool enabling user-shared repositories, conditional and iterative pipeline descriptions, and YAML workflow generation, reducing learning and development time significantly compared to existing orchestration tools. SIM-PIPE's development introduced a novel approach for dry running Big Data pipelines in a secure, controlled environment, featuring enhanced GUI, RESTful API, and integration with Kubernetes, Argo workflows, and Grafana for improved interoperability and user experience. With R-MARKET, the project refined blockchain-based resource marketplace components, optimizing resource allocation processes and enabling structured transactions through smart contracts, marking a significant step forward in creating a decentralized marketplace for edge-cloud resources. The implementation of ADA-PIPE and DEP-PIPE for the deployment and management of data pipelines across the Cloud/Fog/Edge Computing Continuum showcases a leap in the automatic adaptation of data pipelines, ensuring seamless deployment across varied computing resources. Finally, the second version of the ADA-PIPE’s adaptation and scheduling algorithm introduced advanced monitoring and prediction capabilities for resource requirements and malfunctioning devices, facilitating efficient pipeline adaptation and scheduling through a network-accessible service.

Logo

Periodic Reporting for period 2 - DataCloud (ENABLING THE BIG DATA PIPELINE LIFECYCLE ON THE COMPUTING CONTINUUM)

Télécharger Télécharger le contenu de la page