Data has become the fuel of the digital economy and one of the most valuable assets in the world. The true potential of this emerging market is the ability to extract valuable and trustworthy knowledge from the huge volume of heterogeneous and dispersed raw data, and efficiently use it to create a positive impact across many domains of our life and enable advanced decision-making strategies. To harvest this potential, the synergy of cutting-edge technologies is needed to handle the complete data lifecycle and value chain, from the collection of huge volumes of data across highly distributed and heterogeneous sources, the mining of meaningful, accurate, reliable and useful knowledge and the final consumption of the data by applications and end users, in a secure and trustworthy manner.
However, current solutions fail when coping with Extreme Data, defined as data that simultaneously exhibit several of the following characteristics: (1) volume, speed and variety; (2) complexity, diversity; (3) dispersity of data sources; and (4) existence of sparse, missing, insufficient data, often with extreme variability. The reason behind this technological gap is that that current computing technologies are optimized to cope with some of these data characteristics independently and uniformly, rather than addressing them in a holistic manner. This is the case of High-Performance Computing (HPC) technologies, which provides massive parallel processing capabilities to address volume and speed and enable complex artificial intelligence models to deal with data complexity and incompleteness. Edge computing technologies can efficiently address the real-time and the disperse of data sources characteristics by moving the computation where data is captured. Finally, cloud computing technologies plays a fundamental role in dealing with extremely large volumes, by providing highly scalable storage systems. Therefore, there is a need for novel and more holistic approaches that glue together all the aforementioned technologies into a unified solution, capable of managing and analysing the complete lifecycle of extreme data.
Towards this direction, EXTRACT has identified three key objectives: (1) Optimize current data infrastructures and AI & Big-data frameworks to jointly address extreme data characteristics, data processes and analytics methods; (2) development of novel data-driven orchestration techniques to select the most appropriate set of computing resources to address extreme data characteristics; and (3) increase the interoperability between the most common programming practices and execution models used across the compute continuum, including edge, cloud and HPC. Addressing the above three objective in a holistic manner is fundamental to take full benefit of extreme data on industrial, business, environmental, scientific and societal data. To do so, EXTRACT aims to integrate these three relevant objectives in a unified software stack capable of applying data mining methods on extreme data. The capability of the EXTRACT technology will be demonstrated on two ambitious yet realistic use-cases from the Crisis Management and Space Science domains: A Personalized Evacuation Route and the Transient Astrophysics with a Square Kilometre Array pathfinder (TASKA).