Periodic Reporting for period 2 - PSAP (Productive Spatial Accelerator Programming)
Okres sprawozdawczy: 2023-07-01 do 2024-12-31
Our goal is to achieve a breakthrough in computing by enabling programmers to utilize spatial heterogeneous systems with a highly productive programming, tuning, deployment, and execution environment such as Python on a cluster in the cloud.We target the possibly hardest challenge in IT today: enabling society to efficiently exploit the enormous
possibilities opened up by heterogeneous hardware architectures such as today’s hardware accelerators (e.g.,GPUs) or tomorrow’s reconfigurable and domain-specific processors. These also form the major driver in artificial intelligence today. The major obstacle is the missing ability to program such architectures portably and efficiently. The PSAP project proposes to develop the necessary theory, implementation, and libraries for solving this problem.
We can categorise the achievements of this project by differentiating between “performance results”, i.e. instances where data-centric optimisations have been shown to lead to large performance improvements in applications. Examples for this category of results are described in detail in the publications “Deinsum: Practically I/O Optimal Multilinear Algebra” where we accelerate dense linear algebra by a factor of two up to 18 (compared to a state-of-the-art library for the same purpose, executed on 512 compute nodes). In our work “Productive Performance Engineering for Weather and Climate Modeling with Python” we show that data-centric optimisations are also applicable to highly optimised simulation codes, such as the FV3 weather code (see above for details on achieved performance). In our work “FMI: Fast and Cheap Message Passing for Serverless Functions” we show that optimising data-movement delivers performance improvements not only for “classical” HPC problems but also for data-centre workloads such as serverless computing. In such settings we are able to demonstrate improvements by two orders of magnitude.
These impressive performance results are based on more basic research carried out by us in line with the DoA. We want to highlight some specific results here which showcase the quality and novelty of our work. We have developed novel algorithms to perform efficient differential testing of data-flow programs in our work “FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs” - this allowed us to find bugs in the optimisations for weather and climate codes in a matter of seconds, where a run of the full workload can easily take days on hundreds of nodes. Our work on performance embeddings is similarly foundational. We use the optimisation strategies outlined in “Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization” to tune a wide variety of programs in a semi-automated manner. Our data-flow centric program representation is based on parametric graphs, in order to transform such graphs into representations of more efficient programs, multiple well-known graph problems (finding paths, capacities along paths, reachability analysis, etc.) have to be solved. While these problems are well-studied on non-parametric graphs, many of these problems have not been explored for parametric graphs and we have made significant progress in that area as well, i.e. the work published in our paper “Maximum Flows in Parametric Graph Templates” provides novel and efficient algorithms for the maximum flow problem in parametric graphs.
We did not only apply data-centric optimisations to existing problems and architectures. We also worked on designing architectures for data-flow optimised algorithms, i.e. topologies which are easy to manufacture and offer high bandwidth and low latency. Examples of that work are our publications “Sparse Hamming Graph: A Customizable Network-on-Chip Topology” and “HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement”
Overall we can proudly say that the full-stack approach to data-centric optimisations outlined in the DOA was a big success. We have been able to show the applicability of our work across a wide range of fields, demonstrated lighthouse performance results, which were based on a wide array of algorithmic and architectural improvements, which we had the man-power to carry out, thanks to the structure of this project.