Periodic Reporting for period 2 - BeyondMoore (Pioneering a New Path in Parallel Programming Beyond Moore’s Law)
Okres sprawozdawczy: 2023-02-01 do 2024-07-31
Current approaches to programming such heterogeneous systems are often host-centric, meaning that the main processor orchestrates the entire execution, which can lead to scalability issues and limit the types of parallelism that can be effectively exploited. BeyondMoore seeks to overcome these limitations by proposing an accelerator-centric execution model, where accelerators have a greater degree of autonomy and can collaborate and communicate with each other without constant intervention from the host processor. We refer to this programming model as CPU-free execution, where the CPU is removed from the critical path of computation.
The execution model's objectives include enhancing programmer productivity, achieving high program performance, and eliminating host control. BeyondMoore aims to provide a rich set of programming abstractions. These abstractions will effectively hide the complexities of programming, making it easier for programmers to utilize the diverse hardware resources available. A critical aspect of BeyondMoore is the development of a heterogeneity-centric compiler. This compiler performs static code transformations and data movement optimizations tailored to complex heterogeneous systems. By optimizing code at compile-time, it improves performance and efficiency on diverse hardware architectures. BeyondMoore develops a low overhead runtime system that maximizes the utilization of heterogeneous resources. This runtime system extracts parallelism, schedules computation, and orchestrates data movement to fully exploit the capabilities of diverse hardware accelerators. To guide performance optimizations and monitor data movement effectively, BeyondMoore devices a comprehensive communication modeling tool, called Snoopie. This model proxies the cost of data movement and computation, providing valuable insights into resource utilization and performance bottlenecks. Lastly, BeyondMoore aims to demonstrate the effectiveness of its framework on important real-life applications spanning multiple domains. By showcasing its capabilities in practical scenarios, BeyondMoore validates its approach and highlights its potential impact across various fields.
BeyondMoore represents a forward-looking and ambitious endeavor. It builds upon previous successes, such as the utilization of CPU+GPU systems for Exascale computing, and now sets its sights on more tightly coupled heterogeneous systems for the Post-Moore era. By addressing the software challenges inherent in heterogeneous computing, the success of BeyondMoore could ensure continued computing progress beyond Moore's Law, benefiting science and technology.
Moving forward, our efforts are geared towards enhancing CPU-free computing capabilities through the development of a comprehensive toolchain. We are actively crafting a compiler to translate high-level Python code into efficient CPU-free device code, integrating GPU-initiated communication libraries for streamlined development workflows. Our prototype is set to undergo review at the upcoming Code Generation and Optimization Conference in May 2024.
Additionally, we have implemented FreeGraph, a lightweight runtime system tailored for CPU-free task graph execution in multi-device systems. FreeGraph minimizes CPU involvement and seamlessly scales to multiple GPUs, laying a solid foundation for further advancements. Furthermore, we are designing an API for a unified communication library to optimize device-to-device communication within the CPU-free model, while compiling a review of GPU-centric communication approaches for submission to ACM Computer Surveys by June 2024.
Moreover, our development of Snoopie, a multi-GPU communication profiling tool, addresses the critical need for comprehensive profiling tools in multi-device applications. Snoopie promises to significantly enhance the efficiency of multi-device code development and debugging, currently under review for publication in the proceedings of the International Conference on Supercomputing (ICS’24). Additionally, our project has conducted a comprehensive study benchmarking event sampling features of prominent hardware platforms, published in IEEE Transactions on Parallel and Distributed Systems, providing valuable insights for hardware designers and profiling tool developers.
Currently, supercomputers like Frontier, the world's fastest, heavily rely on GPUs for computing power, with CPUs playing a minor role. However, the CPU-free model proposes a shift where CPUs don't need to be as powerful, leading to a more modular supercomputer design. By the end of the project, we aim to provide an open-source framework with a compiler and runtime system for the CPU-free model.
Another achievement is the creation of the Snoopie tool, which helps analyze and monitor data movement in high-performance computing and deep learning tasks. While Snoopie doesn't replace existing tools, it addresses important gaps in the toolkit by focusing on GPU performance. It has garnered attention from major companies like Facebook and Nvidia, demonstrating its value in improving code efficiency and identifying communication issues. We plan to release the source code for Snoopie under an open-source license, making it available to the wider research community.