Periodic Reporting for period 1 - EXTRACT (A distributed data-mining software platform for extreme data across the compute continuum)
Reporting period: 2023-01-01 to 2024-06-30
However, current solutions fail when coping with Extreme Data, defined as data that simultaneously exhibit several of the following characteristics: (1) volume, speed and variety; (2) complexity, diversity; (3) dispersity of data sources; and (4) existence of sparse, missing, insufficient data, often with extreme variability. The reason behind this technological gap is that that current computing technologies are optimized to cope with some of these data characteristics independently and uniformly, rather than addressing them in a holistic manner. This is the case of High-Performance Computing (HPC) technologies, which provides massive parallel processing capabilities to address volume and speed and enable complex artificial intelligence models to deal with data complexity and incompleteness. Edge computing technologies can efficiently address the real-time and the disperse of data sources characteristics by moving the computation where data is captured. Finally, cloud computing technologies plays a fundamental role in dealing with extremely large volumes, by providing highly scalable storage systems. Therefore, there is a need for novel and more holistic approaches that glue together all the aforementioned technologies into a unified solution, capable of managing and analysing the complete lifecycle of extreme data.
Towards this direction, EXTRACT has identified three key objectives: (1) Optimize current data infrastructures and AI & Big-data frameworks to jointly address extreme data characteristics, data processes and analytics methods; (2) development of novel data-driven orchestration techniques to select the most appropriate set of computing resources to address extreme data characteristics; and (3) increase the interoperability between the most common programming practices and execution models used across the compute continuum, including edge, cloud and HPC. Addressing the above three objective in a holistic manner is fundamental to take full benefit of extreme data on industrial, business, environmental, scientific and societal data. To do so, EXTRACT aims to integrate these three relevant objectives in a unified software stack capable of applying data mining methods on extreme data. The capability of the EXTRACT technology will be demonstrated on two ambitious yet realistic use-cases from the Crisis Management and Space Science domains: A Personalized Evacuation Route and the Transient Astrophysics with a Square Kilometre Array pathfinder (TASKA).
1. Optimize current data infrastructures and AI & Big-data frameworks. The EXTRACT data infrastructure includes the following components: (1) The ingestion component, responsible of inserting the data into the EXTRACT platform; (2) the data and metadata layer, including the components responsible of data storage in the form of object storage and time-based series, and data sharing across the compute continuum in a transparent way; (3) a semantic engine component, which creates and maintains logical relations between different data elements; and (4) the data staging component, responsible of guaranteeing a scalable and elastic data processing solutions by dynamically partition the data for the workflow. Moreover, a set of abstractions on top of the data infrastructure is being implemented to develop and execute workflows: An EXTRACT workflow will consume the data stored and will produce datasets as part of its operation, in addition to other types of results (e.g. visualizations, notifications to external systems).
2. Development of novel data-driven orchestration techniques. EXTRACT is developing various components and strategies for an efficient deployment and execution of workflows, by selecting the most appropriate computing resources across the compute continuum. Concretely, the orchestration architecture includes three separated layers: (i) The application layer, responsible for scheduling the data-driven workflow by creating and executing tasks based on its logical dependencies; (ii) the infrastructure layer, managed by Kubernetes and responsible of deploying the workflow as pods; and (iii) the monitoring layer, which collects real-time metrics related to resource utilization, application performance, and system state.
3. Increase the interoperability across the compute continuum. EXTRACT is designing a platform capable of accommodating the highly diverse constraints of the compute continuum, involving different premises: Edge is typically resource-constrained, while cloud and HPC are typically far less constrained but reside in distant locations, especially in terms of network distance. The EXTRACT compute continuum aims to pool the resources at each premise to make them available to applications on one hand, and to provide a uniform environment for applications and services on the other hand. This leads us to define the EXTRACT compute continuum as a set of clusters with an inter-connecting back-bone that consists of several key services: orchestration, global data access and data catalog.
4. Use-case development. The PER use-case is being developed in the urban context of Venice, a city with an intricate network of narrow lanes and canals that makes traditional evacuation strategies inadequate. The PER use-case combines Urban Digital Twins and Reinforcement Deep Learning techniques to generated personalized evacuation routes in case of emergency. The TASKA use-case is centred around the challenges of handling extreme volumes of data generated by modern astronomical observatories like LOFAR, NenuFAR, and the SKA. These facilities are at the forefront of radio astronomy, capturing vast amounts of data from the Universe, particularly from dynamic celestial phenomena. Both use cases are being developed with the EXTRACT platform.