Periodic Reporting for period 2 - MaX (MAterials design at the eXascale)
Okres sprawozdawczy: 2024-01-01 do 2025-06-30
Objective I. Lighthouse applications
Selected community codes for quantum materials modelling, largely adopted by a broad body of end users and already working on pre-exascale systems, will be turned into lighthouse applications, ready to run on new HPC exascale platforms, while further developing their scientific capabilities.
Objective II. Exascale workflows and data
The evaluation of complex materials properties will be encoded in exascale-oriented workflows, allowing lighthouse applications to be driven automatically and orchestrated to exploit exascale capabilities, deliver resilience and fault tolerance, and ensure the dissemination of the entire simulation protocol, results, and data.
Objective III. Addressing technical challenge
The technical challenges faced by the flagship codes at exascale are identified and addressed, to provide technology insight and solutions to the code developers. This will start from the analysis of performance to discover the bottlenecks and the direction for optimisation at scale. We will leverage on the exploitation of innovative features of programming models to help maximising the performance of novel architectures (from data vectorisation to the removal of communication synchronisation).
Objective IV. Co-design and technology exploitation
HW and SW innovations are monitored and included in the MAX co-design cycle, while optimising the exploitation of heterogeneous architectures, and delivering a set of validated co-design vehicles and best practices to be shared with other HPC stakeholders. MaX co-design acts at different levels (chip, node, system) targeting HW relevant for the development of EuroHPC exascale machines. In addition to optimal performance, we will explore techniques to reduce energy to solution. The energy efficiency will be tackled from the hardware side, by deploying the codes on cutting edge processors and accelerators, and from the software side by using European energy aware runtime systems.
- The porting of MaX codes to accelerated architectures based on NVIDIA, AMD, and INTEL GPUs has been targeted for extension and further optimization. Benchmark runs on all pre-exascale machines have been collected.
- The intra- and inter-node performance of all codes has been continuously benchmarked and sensibly optimized. Large scale calculations exploiting hundreds of GPU-accelerated nodes have been demonstrated multiple times.
- A large number of external optimized libraries from the ecosystem (from linear algebra to advanced algorithms and solvers) have been tested on MaX codes and can be exploited in production runs.
A- ll codes have produced public stable releases (often more than one per year), with developments at an earlier stage made available to users via dedicated branches in public git repositories.
- While undergoing this technical development, the codes have been equipped with a number of new features (scientific capabilities, performance and robustness algorithms), aimed at addressing selected scientific grand challenges via exascale workflows, as developed by WP2.
Obj II. Key results M13-30:
- The set of identified target scientific grand challenges to be addressed by MaX exascale workflows has been consolidated and further extended with new use cases.
- Hero runs exploiting AiiDA on the Leonardo supercomputer (up to 3000 nodes) have been executed to train ML interatomic potentials.
- Previously identified Code interoperability requirements have been addressed and implemented.
- Technical tools aimed at workflow implementation such as the AiiDA orchestrator and the HyperQueue meta-scheduler have been further developed and adapted.
- FAIR data and storage: the implemented mirroring mechanism involving CSCS and CINECA has been extended to the Juelich Supercomputing Centre.
Obj III. Key results M13-30:
- Work focused on deploying the MaX flagship codes on many other EuroHPC systems monitoring and reporting their efficiency and performance.
- Notably, benchmark data on all the accelerated partitions of the EuroHPC pre-exscale machines, as well as preliminary data on Jupiter, have been collected and analyzed.
- Technical solutions based on advanced features and or programming models have been experimented in realistic runs of the flagship code.
- Deployment of MaX flagship codes on EuroHPC machines has been achieved, with a coverage of supported architectures of 98%.
Obj IV. Key results M13-30:
- MaX mini-apps have been fully extracted, made available in a dedicated repository, and exploited for codesign activities.
- Codesign: 11 technical co-design reports covering analysis and optimisation and 3 general reports providing information on EUPEX and SiPearl technology to the MaX consortium have been released.
- Advanced Hardware: MaX flagship codes have been tested on advanced hardware, relevant for the European HPC roadmap.
- Energy efficiency has been evaluated on multiple production systems, now including GPU-accelerated partitions, for all MaX codes (via Meric). The results show a gain on energy to solution of 10-19% on CPU, and 6-9% on GPU.
Concerning FAIR data and storage, we implemented a mirroring mechanism that enables the data stored at CSCS to be also available from CINECA and from the Julich HPC centre.
Deployment of MAX flagship codes on EuroHPC machines was achieved on all the adopted inequivalent technologies.