Manual data locality management in parallel applications is not a trivial task and also greatly complicates application development. Tiling is a well-known method for decomposing data that provides both data locality and parallelism in an application. This project has designed and developed tiling-based programming model and its runtime system to address data locality issues in the emerging multicore and heterogeneous architectures.
The first main contribution of the project is the tiling based programming model, called TiDA, that provides a multi-language library interface to express parallelism and data locality with the help of a handful of simple programming abstractions on multicore architectures. The effectiveness of the library is shown with several structured grid applications including an advanced combustion proxy application and geometric multi-grid solver. The library achieves up to 2.10x speedup over OpenMP in a single compute node for simple kernels, and up to 22x improvement over a single thread for a more complex combustion application (SMC) on 24 cores. Lastly the library is integrated into the BoxLib Adaptive Mesh Refinement Framework.
The second main contribution of the project is to provide a uniform programming interface for homogeneous (e.g. CPUs) and heterogeneous processors (e.g. CPUs + GPUs). We extended the concept of tiling in TiDA to overlap data transfers with execution of tiles on the GPU and hide communication overhead. Decomposing data into tiles also enables the programmer to run applications on GPU even though application data cannot entirely fit into the GPU memory. The performance results show that the library achieves comparable performance with CUDA but greatly simplifies the heterogeneous programming.
The third main contribution of the project is the development of asynchronous runtime system that uses tiles as its unit of scheduling. The runtime system leverages application meta-data (tiling abstractions) to enable challenging runtime optimizations. Our task graph-based runtime system requires only modest programming effort because it is aware of TiDA and uses tile dependencies to construct the task dependency graph. Experimental results with different applications on up to 24K processors show that the runtime provides 1.44x speedup over the synchronous code variant.
Dr. Unat, the researcher in change has disseminated the research outcome in various forms.
• Six research articles are published in high quality journals [TPDS’17][SIAM`16] and prestigious conferences [ICPP’17][EuroPar’17][SC’16][ISC’16].
• The software developed in the project has been released as open source.
• With the help of the travel grant provided by the fellowship, Dr. Unat has given numerous talks on the progress of the project through the world (Turkey, Japan, Germany, France, USA, Saudi Arabia, Switzerland, Norway, and England).
• To increase the awareness of data locality issues and raise the interest towards HPC in Turkey, Dr. Unat has organized scientific venues such as PADAL workshop series and a national High Performance Computing Conference (BASARIM).
• Two graduate students took part in the project under the supervision of Dr. Unat. Dr. Unat also taught Parallel Computing course at the host institution and incorporated the HPC and data locality concepts into curriculum.