Productive Spatial Accelerator Programming

Project Information

PSAP

Grant agreement ID: 101002047

Project website

DOI

10.3030/101002047

EC signature date 5 March 2021

Start date 1 January 2022

End date 31 December 2026

Funded under

EXCELLENT SCIENCE - European Research Council (ERC)

Total cost

€ 1 980 917,00

EU contribution

€ 1 980 917,00

1 980 917,00

Coordinated by

EIDGENOESSISCHE TECHNISCHE HOCHSCHULE ZUERICH
Switzerland

Periodic Reporting for period 2 - PSAP (Productive Spatial Accelerator Programming)

Reporting period: 2023-07-01 to 2024-12-31

The PSAP project aims to advance the science and practice of current and future computer programming beyond the frontier of Moore’s law focusing on a new spatial view.
Our goal is to achieve a breakthrough in computing by enabling programmers to utilize spatial heterogeneous systems with a highly productive programming, tuning, deployment, and execution environment such as Python on a cluster in the cloud.We target the possibly hardest challenge in IT today: enabling society to efficiently exploit the enormous
possibilities opened up by heterogeneous hardware architectures such as today’s hardware accelerators (e.g.,GPUs) or tomorrow’s reconfigurable and domain-specific processors. These also form the major driver in artificial intelligence today. The major obstacle is the missing ability to program such architectures portably and efficiently. The PSAP project proposes to develop the necessary theory, implementation, and libraries for solving this problem.

The main achievement of this project is that we have demonstrated that data-centric optimisations are paramount to achieving high-performance across a wide range of domains, from dense linear algebra to graph-mining, weather prediction and machine learning. On top of that, the project developed the tool (DaCe) to perform such optimisations. Leveraging our tools we have won a Gordon Bell award for the fastest ever quantum transport simulation, accelerated the production weather code FV3 by almost a factor of 4 (3.92 at scale, using 2400 compute nodes). Such results are only possible by combining multiple findings and improvements, such as our advances in transfer tuning, developing novel graph algorithms for maximum flow problems which appear when optimising data-centric programs, and our collaborative efforts with industry and academic partners (which ensures our tools are widely adopted, but also helps us to gain insight into upcoming hardware architectures and makes sure our tools are ready to optimise code for them when they are deployed).

We can categorise the achievements of this project by differentiating between “performance results”, i.e. instances where data-centric optimisations have been shown to lead to large performance improvements in applications. Examples for this category of results are described in detail in the publications “Deinsum: Practically I/O Optimal Multilinear Algebra” where we accelerate dense linear algebra by a factor of two up to 18 (compared to a state-of-the-art library for the same purpose, executed on 512 compute nodes). In our work “Productive Performance Engineering for Weather and Climate Modeling with Python” we show that data-centric optimisations are also applicable to highly optimised simulation codes, such as the FV3 weather code (see above for details on achieved performance). In our work “FMI: Fast and Cheap Message Passing for Serverless Functions” we show that optimising data-movement delivers performance improvements not only for “classical” HPC problems but also for data-centre workloads such as serverless computing. In such settings we are able to demonstrate improvements by two orders of magnitude.

These impressive performance results are based on more basic research carried out by us in line with the DoA. We want to highlight some specific results here which showcase the quality and novelty of our work. We have developed novel algorithms to perform efficient differential testing of data-flow programs in our work “FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs” - this allowed us to find bugs in the optimisations for weather and climate codes in a matter of seconds, where a run of the full workload can easily take days on hundreds of nodes. Our work on performance embeddings is similarly foundational. We use the optimisation strategies outlined in “Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization” to tune a wide variety of programs in a semi-automated manner. Our data-flow centric program representation is based on parametric graphs, in order to transform such graphs into representations of more efficient programs, multiple well-known graph problems (finding paths, capacities along paths, reachability analysis, etc.) have to be solved. While these problems are well-studied on non-parametric graphs, many of these problems have not been explored for parametric graphs and we have made significant progress in that area as well, i.e. the work published in our paper “Maximum Flows in Parametric Graph Templates” provides novel and efficient algorithms for the maximum flow problem in parametric graphs.

We did not only apply data-centric optimisations to existing problems and architectures. We also worked on designing architectures for data-flow optimised algorithms, i.e. topologies which are easy to manufacture and offer high bandwidth and low latency. Examples of that work are our publications “Sparse Hamming Graph: A Customizable Network-on-Chip Topology” and “HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement”

Overall we can proudly say that the full-stack approach to data-centric optimisations outlined in the DOA was a big success. We have been able to show the applicability of our work across a wide range of fields, demonstrated lighthouse performance results, which were based on a wide array of algorithmic and architectural improvements, which we had the man-power to carry out, thanks to the structure of this project.

DaCe already outperforms traditional compilers for classical CPU architectures and GPUs for many workloads, until the end of the project we expect to also integrate more modern spatial architectures and better support for machine learning. Data-centric programming also is a promising stepping stone in bridging the gap between AI and science (i.e. allowing users to use automatic differentiation in classical HPC codes without the use of adjoint models). These are areas we are investigating as of now and we plan to make significant progress in until the end of the project.

logo

symbolic image showing project results

Periodic Reporting for period 2 - PSAP (Productive Spatial Accelerator Programming)

Download Download the content of the page