Skip to main content
European Commission logo
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems

Periodic Reporting for period 2 - RECIPE (REliable power and time-ConstraInts-aware Predictive management of heterogeneous Exascale systems)

Période du rapport: 2019-11-01 au 2021-10-31

By 2023 High Performance Computers (HPC) should be able to compute 1018 operations per second – or Exascale
RECIPE addresses a crucial problem, namely to develop a software that is between the hardware and the applications and that is able to make the system reliable despite the increasing number of resources and the increasing time between failures.

RECIPE provides:
- A hierarchical runtime resource management infrastructure able to optimise energy efficiency and to minimise the occurrence of thermal hotspots. Such infrastructure will also enforce the time constraints imposed by the application, ensuring reliability for both time- critical and throughput-oriented computations;
- A predictive reliability methodology to support QoS in face of both transient and long-term hardware failures;
- A set of integration layers to allow the resource manager to interact with both the application and the underlying deeply heterogeneous architecture§;
- A simulation-based platform for validating the resource management policies at large scale.

RECIPE’s goals
1. To increase the energy efficiency of HPC systems by 25%, with an improvement of 15% of MTTF;
2. To improve the energy-delay product by up to 25%;
3. To reduce the occurrence of fault executions by 20% with recovery times compatible to real-time performance and full exploitation of available resources under non-saturated conditions.

RECIPE assessed its results against real world use cases, addressing key application domains:
1. Geophysical exploration: thanks to the efficient implementation of the RTRM, the resulting Full Waveform Inversion tool reduces the uncertainty of current seismic exploration surveys;
2. Environmental monitoring and meteoreology: the developed RTRM will improve the ability to keep the status of water basins under control and the behaviour of power plants exploiting renewable energy sources (RES) such as wind turbines;
3. Bio-medical machine learning and big data analytics: the developed software infrastructure will enable the deployment of the epileptic seizure detection algorithms in a prototype platform able to manage a large-scale population while meeting the real-time requirements of the application.

To enact this ambitious research and innovation program, the RECIPE project relies on a consortium composed of leading academic partners, including POLIMI, the largest technical University in Italy, providing expertise on resource management and programming models as well as scientific coordination; EPFL, the leading provider of thermal models for HPC; UPV, one of the key innovators in optimized interconnection networks, CeRICT, providing expertise on accelerators; as well as two supercomputing centers: BSC, one of the leading HPC providers in Europe with the MareNostrum, classed 13th in the Top 500 in June 2017, PSNC, another Top 500 HPC center in Poland; a research hospital from Switzerland, CHUV, and an SME active in product design and development, IBTS, which provides effective exploitation avenues through industry-based use cases.
At the end of the project, the RECIPE Consortium achieved its stated goals, developing and deploying the following technologies:
- An heterogeneous, reconfigurable, multi-accelerator hardware platform for high performance computing, implementing remote resource access through a dedicated hardware abstraction layer;
- A software stack supporting the programmability of the platform, as well as the hierarchical management of resource allocation;
- A set of resource management policies targeting performance, energy, and reliability, powered by timing, reliability and thermal models;
- A set of extensions to the DCworms large-scale simulator to model the RECIPE platforms, applications, and resource management policies;
- A set of three integrated applications, demonstrating the RECIPE technologies.
The RECIPE hardware and software stack enabled the consortium to demonstrate how predictive reliability can be effectively used to manage the increase in frequency of errors due to the increasingly growing scale of HPC systems.
By exploiting probabilistic execution time analysis, component degradation models taking into account aging induced by high operating temperature, and appropriate control policies, it is possible to reduce by 20-25% the failures in time, while keeping the utilization of the system over 90%. Furthermore, the automated resource management coupled with the enabling of remote resource access can improve the energy efficiency of applications, reducing significantly the cost of operation of HPC centers as well as their carbon footprint.
Overall, the scientific findings of RECIPE were reported in around 50 peer reviewed publications.
The research directions pioneered by RECIPE will be carried on in seven projects funded by the European Union under the EuroHPC, H2020 and Horizon Europe programmes.
Project Logo
Data Center