The main achievement of this project is that we have demonstrated that data-centric optimisations are paramount to achieving high-performance across a wide range of domains, from dense linear algebra to graph-mining, weather prediction and machine learning. On top of that, the project developed the tool (DaCe) to perform such optimisations. Leveraging our tools we have won a Gordon Bell award for the fastest ever quantum transport simulation, accelerated the production weather code FV3 by almost a factor of 4 (3.92 at scale, using 2400 compute nodes). Such results are only possible by combining multiple findings and improvements, such as our advances in transfer tuning, developing novel graph algorithms for maximum flow problems which appear when optimising data-centric programs, and our collaborative efforts with industry and academic partners (which ensures our tools are widely adopted, but also helps us to gain insight into upcoming hardware architectures and makes sure our tools are ready to optimise code for them when they are deployed).
We can categorise the achievements of this project by differentiating between “performance results”, i.e. instances where data-centric optimisations have been shown to lead to large performance improvements in applications. Examples for this category of results are described in detail in the publications “Deinsum: Practically I/O Optimal Multilinear Algebra” where we accelerate dense linear algebra by a factor of two up to 18 (compared to a state-of-the-art library for the same purpose, executed on 512 compute nodes). In our work “Productive Performance Engineering for Weather and Climate Modeling with Python” we show that data-centric optimisations are also applicable to highly optimised simulation codes, such as the FV3 weather code (see above for details on achieved performance). In our work “FMI: Fast and Cheap Message Passing for Serverless Functions” we show that optimising data-movement delivers performance improvements not only for “classical” HPC problems but also for data-centre workloads such as serverless computing. In such settings we are able to demonstrate improvements by two orders of magnitude.
These impressive performance results are based on more basic research carried out by us in line with the DoA. We want to highlight some specific results here which showcase the quality and novelty of our work. We have developed novel algorithms to perform efficient differential testing of data-flow programs in our work “FuzzyFlow: Leveraging Dataflow To Find and Squash Program Optimization Bugs” - this allowed us to find bugs in the optimisations for weather and climate codes in a matter of seconds, where a run of the full workload can easily take days on hundreds of nodes. Our work on performance embeddings is similarly foundational. We use the optimisation strategies outlined in “Performance Embeddings: A Similarity-Based Transfer Tuning Approach to Performance Optimization” to tune a wide variety of programs in a semi-automated manner. Our data-flow centric program representation is based on parametric graphs, in order to transform such graphs into representations of more efficient programs, multiple well-known graph problems (finding paths, capacities along paths, reachability analysis, etc.) have to be solved. While these problems are well-studied on non-parametric graphs, many of these problems have not been explored for parametric graphs and we have made significant progress in that area as well, i.e. the work published in our paper “Maximum Flows in Parametric Graph Templates” provides novel and efficient algorithms for the maximum flow problem in parametric graphs.
We did not only apply data-centric optimisations to existing problems and architectures. We also worked on designing architectures for data-flow optimised algorithms, i.e. topologies which are easy to manufacture and offer high bandwidth and low latency. Examples of that work are our publications “Sparse Hamming Graph: A Customizable Network-on-Chip Topology” and “HexaMesh: Scaling to Hundreds of Chiplets with an Optimized Chiplet Arrangement”
Overall we can proudly say that the full-stack approach to data-centric optimisations outlined in the DOA was a big success. We have been able to show the applicability of our work across a wide range of fields, demonstrated lighthouse performance results, which were based on a wide array of algorithmic and architectural improvements, which we had the man-power to carry out, thanks to the structure of this project.