Periodic Reporting for period 4 - DAPP (Data-centric Parallel Programming)
Okres sprawozdawczy: 2020-12-01 do 2021-09-30
Designing this hardware is challenging but doable as industry has demonstrated. What remains a large open problem is how to program this hardware. The DAPP project squarely addresses that challenge by developing a data-centric programming languages and intermediate representation. DAPP also innovates on the side of splitting the roles of domain programmer (or scientist) and performance engineer by defining a clear abstraction between the two roles.
The project has already demonstrated its applicability to compile near-optimal programs to GPU, CPU, and FPGA from the same data-centric code-base. We will continue to refine the use-facing language as well as the optimization language for performance engineers. This will also touch the internal graph representation. The project aims to deliver a workable implementation that is released to the community for further research and development.
To provide a complete picture of the importance of data locality and data-centric programming, we summarized, together with other leading research groups, the collective knowledge on the topic. The review paper, titled “Trends in Data Locality Abstractions for HPC Systems”, was published at IEEE TPDS. For high-performance programming of FPGAs, we wrote a review and in-depth analysis of the techniques that are used to optimize HPC applications for data movement on dataflow architectures, consolidating the insights and state-of-the-art in High-Level Synthesis (HLS) for FPGAs.
While general-purpose programming interfaces and IRs can accelerate a wide variety of regular applications, irregular applications, such as graph algorithms, pose additional challenges that we investigated. In particular, we constructed graph representations that incur reduced data movement, such as the Log(Graph) succinct representation; representations that are amenable to efficient processing (e.g. using vectorization) for CPUs and GPUs with SlimSell; and efficient representations for streaming graph algorithms on FPGAs, namely a substream-centric representation for the Maximum Weighted Matching algorithm.
Optimizing communication performance is imperative for large-scale computing, as overheads limit the strong scalability of parallel applications. Today’s network cards also contain rather powerful processors optimized for data movement. As part of DAPP, we developed sPIN, a portable programming model to offload simple packet processing functions to the network card. The model provides both the simplicity of accelerator languages such as CUDA, and the flexibility of directly controlling the network card to optimize collective communication operations and system services by bypassing the CPU.
Lastly, we leveraged Machine Learning techniques and Deep Learning to statically analyze and comprehend code semantics. In particular, we propose Neural Code Comprehension, a novel processing pipeline to learn code semantics robustly, based on a novel embedding space that we call inst2vec. The pipeline analyzes the dataflow of an application (using an IR), and apply it to a variety of program analysis tasks, including algorithm classification, hardware mapping (i.e. whether a program will run faster on a CPU or a GPU), and thread workload coarsening factors, where we set a new state-of-the-art in accuracy for two out of the three tasks.