Taming Irregular Computations On Heterogeneous processors

There has been a lot of effort that strives to develop general purpose homogeneous processors being both fast and energy-efficient for any algorithm. Unfortunately, the existence of accelerators indicates that all attempts thus far at building such a processor have failed1. As can be seen, more and more heterogeneous processors are equipping modern supercomputers. For example, the current top 3 supercomputers2 (i.e. Sunway TaihuLight, Tianhe-2, and Titan) are all built with heterogeneous architectures (i.e. SW26010, Xeon Phi and GPU, respectively). Despite the progress of hardware infrastructure, the utilization of heterogeneous computing is still not satisfactory in practice. For example, a recent report3 showed that the supercomputer Titan only had 16.2% jobs running on its GPUs, that is a small proportion compared to 65.4% jobs still merely using its CPUs. The utilization mainly suffers from two bottlenecks: (P1) distributed memories (i.e. system memory on the host side and device memory on the accelerator side) bring high cost for software engineering (explicitly manages data transfer) and degraded performance (caused by frequent low speed data movement), and (P2) lack of heterogeneity-oriented scalable approaches (able to use both latency-oriented heavyweight cores and throughput-oriented lightweight cores) for irregular problems (such as graph and sparse matrix processing that often requires unpredictable memory access, divergent control structures, and fine-grain synchronization and communication).

The objective of the project Taming Irregular Computations On Hterogeneous processors (TICOH) is to address the issue of currently unsatisfactory utilization of heterogeneous computing for irregular problems such as graph and sparse matrix processing. Following a multi-level approach which bridges the domains of performance measurement, benchmark data analysis, modeling, data structure construction, algorithm design and application integration, TICOH will explore best practices that toward the best performance for irregular computations on the best hardware selection. Specifically, the main focus of the project will be to (a) identify and understand bottlenecks of current heterogeneous computing (e.g. latency and bandwidth of synchronization and communication in heterogeneity-aware parallel kernels), (b) benchmark and model heterogenous processors composed of CPU, GPU and high-bandwidth memories (e.g. AMD Bristol Ridge, Intel Skylake and NVIDIA Tegra), (c) design and evaluate new data structures and algorithms for irregular problems aiming for fully use computing and memory resources provided by heterogeneous processors, and (d) integrate and apply the newly designed approaches for high-level applications (e.g. scientific software, graph databases and sparse convolutional neural networks). By empirically investigating these issues, the ultimate goal of the project is to allow a broad range of real-world applications to further benefit from heterogeneous hardware in the new era.

"TICOH is the acronym for the project entitled ""Taming Irregular Computations On Heterogeneous processors"" and granted by the EU H2020 Marie Skłodowska-Curie actions (MSCA) Individual fellowships (IF). The individual research in liaison with the host organization Norwegian University of Science and Technology (NTNU) lasts for two years (2017-2019).

Now more and more heterogeneous processors are equipping modern supercomputers. Unfortunately, despite the progress of hardware infrastructure, the utilization of heterogeneous computing is still relatively low in practice. The objective of the project TICOH is to address the issue of currently unsatisfactory utilization of heterogeneous computing for irregular problems such as graph and sparse matrix processing. Achieving this requires a multi-level approach for best practices that toward best performance for irregular computations on best hardware selection."

The project is terminated in the end of M17 but not M24. The reason is that the Fellow Weifeng Liu received an offer from his home university, China University of Petroleum, Beijing. Since the university offered the Fellow an attractive position as a Full Professor in Computer Science and the Dean of the Collage of Information Science and Engineering, he decided to terminate his contract with NTNU and the MSCA project ``TICOH'' with love. As a result, this report for M01--M17 is the final report of the project.

The Fellow constructed a benchmarks suite containing sparse matrix and graph problems (D2.1-M02) by using around 1000 matrices/graph for benchmarking and collecting a rich set of experimental data (D2.2-M05). Most of the data have been published with a joint paper ``Exploring and analyzing the real impact of modern on-package memory on HPC scientific kernels'' (SC '17) and demonstrated in the fellow's talk ``Scalability Analysis of Sparse Matrix Computations on Many-core Processors'' (Sparse Days '17 etc.). A performance model (D2.3-M09) named ``stepping model'' presented in the above SC '17 paper as well. This paper has been nominated as a best paper award at SC '17 conference. Another execution model (D2.3-M09) named ``Warp-Consolidation'' also has been developed and published as a paper ``Warp-Consolidation: A Novel Execution Model for GPUs'' at ICS '18 conference.

The Fellow also developed several parallel algorithms for sparse matrix multiplication (D3.2-M17) and published papers ``Register-based Implementation of the Sparse General Matrix-matrix Multiplication on GPUs'' at PPoPP '18 and ``Register-Aware Optimizations for Parallel Sparse Matrix-Matrix Multiplication'' at journal IJPP. As for parallel sparse triangular solve (D3.3-M21) the Fellow published papers ``Fast Synchronization-Free Algorithms for Parallel Sparse Triangular Solves with Multiple Right-Hand Sides'' at the journal CCPE and ``swSpTRSV: A Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architectures'' at PPoPP '18. The fellow also researched the depth-first search algorithm (D3.1-M13) but has not programmed efficient code and published paper on it.

To summarize, from M01 to M17, the fellow have published in total nine technical papers and given 13 invited talks (six at conferences/workshops and seven at institutions, (D5.2-M03 M06, M07, M10, M12, M14) under the support of the MSCA TICOH project. The website of the TICOH project has been online in July 2017 (D5.1-M01). The fellow also has co-organized two minisympisia at international conferences and served as a technical program committee member of four international conferences and two workshops, and a reviewer of a number of internaltional journals.

Periodic Reporting for period 1 - TICOH (Taming Irregular Computations On Heterogeneous processors)

Diese Seite teilen

Herunterladen