Extreme Near-Data Processing Platform

Informations projet

NEARDATA

N° de convention de subvention: 101092644

DOI

10.3030/101092644

Projet clôturé

Date de signature de la CE 16 Novembre 2022

Date de début 1 Janvier 2023

Date de fin 31 Decembre 2025

Financé au titre de

Digital, Industry and Space

Coût total

€ 3 913 585,00

Contribution de l’UE

€ 3 913 585,00

3 913 585,00

Coordonné par

UNIVERSITAT ROVIRA I VIRGILI
Spain

Periodic Reporting for period 1 - NEARDATA (Extreme Near-Data Processing Platform)

Période du rapport: 2023-01-01 au 2024-06-30

According to multiple analyst estimates, around 80% to 90% of data is unstructured (text, images, audio, video), which means that it is not stored in searchable formats (like a database). As a consequence, huge pools of unstructured data are nowadays locked away in Object stores as bulk data that it is very difficult to mine and analyse. Unstructured Data Management (discovery, collection, mining, filtering and processing) is still a nightmare for data scientists as they have to cope with a proliferation of data sources from heterogeneous domains. Each of these domains includes different domain-specific data manipulation mechanisms, data pipelines, and governance models which data scientists must become acquainted to.

The NEARDATA project is focusing in three health data domains containing large unstructured data: metabolomics (images), genomics (text), and surgery data (video). The selected data domains comply perfectly with the definition of extreme data. First of all, OMICs setting have a relevant problem of huge and increasing data volumes that push current technologies to their limits. Here, there is a strong challenge in the data ingestion problem from Object Storage to computing analytics services. On the other hand, computer-assisted surgery shows an extreme problem of data speed because they require low latency real time video analytics and robotic IoT event streams. There is here a strong challenge to support real-time video streams that must be processed very fast to the Object Storage at large scale. Finally, health data in general is highly sensitive, so it has tough privacy and security requirements that preclude many hospitals and research labs to share, move, or publish their data openly. The big challenge here is to discover and distil meaningful, reliable and useful data from heterogeneous and dispersed/scarce sources in a trustworthy way with stringent confidential requirements.

This project is extremely ambitious regarding to the expected impact that it may achieve, First of all, it aims to improve European leadersip in the global data economy thanks to innovative data technologies. In particular, the projects aims to enable International Data Spaces in three Health domains (metabolomics, genomics, surgomics) which will boost innovation around nowadays complex unstructured data formats in those domains. The project already shows international leadership in metabolomics data, and it is creating innovative technologies for genomics and surgomics.

During the first half of the project we have focused on the research and development of the three challenges in which the NEARDATA platform wants to go beyond the state of the art.

We developed a novel intermediary data service (XtremeDataHub) between Object Storage and Analytic platforms. This Data Access Layer provides serverless type-aware data connectors that optimise data management operations (partitioning, filtering, transformation, aggregation) and interactive queries (search, discovery, matching, multi-object queries) to efficiently present data to analytics platforms. Our data connectors facilitate a elastic data-driven process-then-compute paradigm which significantly reduces data communication on the data interconnect, ultimately resulting in higher overall data throughput.

We seamlessly combined streaming and batch data processing for analytics. We developed stream data connectors deployed as stream operators that provided very fast stateful computations over low-latency event and video streams.

We created a Data Broker service enabling trustworthy data sharing and confidential orchestration of data pipelines across the Compute Continuum. In order to ensure confidentiality and integrity, we developed mechanisms to utilise Trusted Execution Environments (TEEs) along with federated learning architectures. To protect data in flight and rest, we implemented mechanisms for transparent encryption for data transfers as well as storage that require no code modification and provide high throughput at the same time.

The project shows good progress in the four core technologies (Lithops, Pravega, SCONE, Metaspace) and we developed a new data management toolkit (Dataplug) for unstructured data. The tool was integrated and tested with major data platforms like Dask, Lithops, and Ray. The core technologies have been successfully demonstrated in the five use cases.

In summary, the project is progressing at good speed, both in the technical side (engineering) and in the scientific side. With our data management technologies like Metaspace, the Data Broker, amd federated learning, we are facilitating the data sharing in these domains. With our communication technologies (Cloud, Edge, HPC) we are simplifying data analysis in heterogeneous domains.

During the first part of the project, the NEARDATA platform has progressed as planned, with important advances in the scientific community from the development of solutions to the open challenges presented beyond the state of the art, the implementation of new functionalities in the software components of the NEARDATA platform and in the design and development of specific software, and finally, a great impact on the scientific community internationally.

We can identify up to 17 scientific publications produced during the NEARDATA project that address the open challenges that arise when performing analysis on extreme data. We highlight great results in near-data computing solutions such as the Glider component and its real-world application in the genomics use case. Additionally, the processing of data in real time and low latency has been a great point of interest getting excellent results from the application of Pravega in the surgery use case. Finally, enormous efforts have been made to develop secure and trusted environments to guarantee the confidentiality and privacy of sensitive data thanks to the SCONE platform.

Finally, the NEARDATA project includes components used by the scientific community around the world. METASPACE is widely used as a source of metabolomics data to ensure data discovery by the international metabolomics community. In addition, the NEARDATA project has defined new data spaces for health data such as metabolomics, genomics and surgery that aim to offer solutions to the scientific community for dealing with extreme data.

NEARDATA Architecture

NEARDATA Genomics Data Space

NEARDATA Surgomics Data Space

NEARDATA Metabolomics Data Space

Periodic Reporting for period 1 - NEARDATA (Extreme Near-Data Processing Platform)

Télécharger Télécharger le contenu de la page