Skip to main content

Advanced Programming Environments for Exascale Data-Centric Computing

Periodic Reporting for period 1 - DataLocAbstractions (Advanced Programming Environments for Exascale Data-Centric Computing)

Reporting period: 2015-04-01 to 2017-03-31

Parallel programming environments have been designed to optimize for arithmetic operations because computation has been traditionally the most expensive component of a computing system over more than 30 years. Technology trends in the computer architecture break our existing assumptions because moving data has become the most costly component of computing in terms of power consumption and performance. Moreover, we already have 240 cores chip today and within 5 years it is projected that 1000-core chips will be available in Exascale systems (next generation high performance computing systems), which means even more increase in parallelism on the chip. These changes have created an urgent need to redesign the programming environments because none of the prevalent programming languages provide interfaces to support data locality.

Dr. Unat proposed simple programming abstractions for data locality and created advanced programming environments that encapsulate these abstractions to enable the transition of codes from a compute-centric to a data-centric paradigm. The project provides a rich interface to express parallelism and data locality through tiling, data layout and loop traversal abstractions. These abstractions are applied to adaptive mesh refinement applications and enabled them to scale on multicore architectures.

In the first objective, Dr. Unat designed the new programming abstractions and tiling based programming model that hides the details of data decomposition, cache locality and memory affinity management from the programmer. In the second objective, we developed an accompanying runtime system to support the programming abstractions on distributed memory environment. The runtime system leverages the metadata embedded in the programming model and enables runtime optimizations, which is otherwise not possible. In the third objective, we reached our ultimate goal and integrated the abstractions and runtime system into the real-life scientific applications. We tested and demonstrated the performance, productivity, and scalability of the proposed programming environments on applications with various complexities. Finally, the success of the project allowed us to receive further grants from the national funding agency and 6 publications at prestigious conferences and journals.
Manual data locality management in parallel applications is not a trivial task and also greatly complicates application development. Tiling is a well-known method for decomposing data that provides both data locality and parallelism in an application. This project has designed and developed tiling-based programming model and its runtime system to address data locality issues in the emerging multicore and heterogeneous architectures.

The first main contribution of the project is the tiling based programming model, called TiDA, that provides a multi-language library interface to express parallelism and data locality with the help of a handful of simple programming abstractions on multicore architectures. The effectiveness of the library is shown with several structured grid applications including an advanced combustion proxy application and geometric multi-grid solver. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion application (SMC) on 24 cores. Lastly the library is integrated into the BoxLib Adaptive Mesh Refinement Framework.

The second main contribution of the project is to provide a uniform programming interface for homogeneous (e.g. CPUs) and heterogeneous processors (e.g. CPUs + GPUs). We extended the concept of tiling in TiDA to overlap data transfers with execution of tiles on the GPU and hide communication overhead. Decomposing data into tiles also enables the programmer to run applications on GPU even though application data cannot entirely fit into the GPU memory. The performance results show that the library achieves comparable performance with CUDA but greatly simplifies the heterogeneous programming.

The third main contribution of the project is the development of asynchronous runtime system that uses tiles as its unit of scheduling. The runtime system leverages application meta-data (tiling abstractions) to enable challenging runtime optimizations. Our task graph-based runtime system requires only modest programming effort because it is aware of TiDA and uses tile dependencies to construct the task dependency graph. Experimental results with different applications on up to 24K processors show that the runtime provides 1.44x speedup over the synchronous code variant.

Dr. Unat, the researcher in change has disseminated the research outcome in various forms.

• Six research articles are published in high quality journals [TPDS’17][SIAM`16] and prestigious conferences [ICPP’17][EuroPar’17][SC’16][ISC’16].
• The software developed in the project has been released as open source.
• With the help of the travel grant provided by the fellowship, Dr. Unat has given numerous talks on the progress of the project through the world (Turkey, Japan, Germany, France, USA, Saudi Arabia, Switzerland, Norway, and England).
• To increase the awareness of data locality issues and raise the interest towards HPC in Turkey, Dr. Unat has organized scientific venues such as PADAL workshop series and a national High Performance Computing Conference (BASARIM).
• Two graduate students took part in the project under the supervision of Dr. Unat. Dr. Unat also taught Parallel Computing course at the host institution and incorporated the HPC and data locality concepts into curriculum.
The European Commission established Exascale computing as one of the top scientific grand challenges for the coming decade in its 2012 report “High-Performance Computing (HPC): Europe’s Place in a Global Race”. Exascale can be defined as the next generation of HPC systems capable of performing exaflops (10^18 operations per second). The report states emphatically that the European scientific community urgently needs to prepare for Exascale computing by 2020.

In preparation for Exascale, one of the main challenges is data movement, which is projected to
require more energy than that of computation even for short on-chip distances (e.g. 5 mm). This shift makes locality of data the new programing challenge as our current programming environments lack the support for managing data locality in the application. This project provides programming abstractions to utilize parallelism on the chip and manage vertical and horizontal data movement between the multiple levels of memory hierarchy. These abstractions better reflect the realities of current and future platforms and facilitate the transition of Adaptive Mesh Refinement (AMR) applications to Exascale. Indeed the TiDA library is integrated into one of the well-known AMR frameworks, called BoxLib.

With the help of the project, Dr. Unat has been the main driver of the PADAL Workshop Series
(Programming Abstractions for Data Locality), which brings researchers from all around the world to discuss programming abstractions for data locality. These workshops have increased the awareness of the problem and a joint journal article has been published at the IEEE Transactions on Parallel and Distributed Systems (TPDS), which is one of the prestigious journals in the field.
Execution model of TiDA-acc combines GPU and CPU execution.
Locality-aware asynchronous runtime design for TiDA
A grid is divided into regions and then logical tiles for managing data locality