There has been a lot of effort that strives to develop general purpose homogeneous processors being both fast and energy-efficient for any algorithm. Unfortunately, the existence of accelerators indicates that all attempts thus far at building such a processor have failed1. As can be seen, more and more heterogeneous processors are equipping modern supercomputers. For example, the current top 3 supercomputers2 (i.e. Sunway TaihuLight, Tianhe-2, and Titan) are all built with heterogeneous architectures (i.e. SW26010, Xeon Phi and GPU, respectively). Despite the progress of hardware infrastructure, the utilization of heterogeneous computing is still not satisfactory in practice. For example, a recent report3 showed that the supercomputer Titan only had 16.2% jobs running on its GPUs, that is a small proportion compared to 65.4% jobs still merely using its CPUs. The utilization mainly suffers from two bottlenecks: (P1) distributed memories (i.e. system memory on the host side and device memory on the accelerator side) bring high cost for software engineering (explicitly manages data transfer) and degraded performance (caused by frequent low speed data movement), and (P2) lack of heterogeneity-oriented scalable approaches (able to use both latency-oriented heavyweight cores and throughput-oriented lightweight cores) for irregular problems (such as graph and sparse matrix processing that often requires unpredictable memory access, divergent control structures, and fine-grain synchronization and communication).
The objective of the project Taming Irregular Computations On Hterogeneous processors (TICOH) is to address the issue of currently unsatisfactory utilization of heterogeneous computing for irregular problems such as graph and sparse matrix processing. Following a multi-level approach which bridges the domains of performance measurement, benchmark data analysis, modeling, data structure construction, algorithm design and application integration, TICOH will explore best practices that toward the best performance for irregular computations on the best hardware selection. Specifically, the main focus of the project will be to (a) identify and understand bottlenecks of current heterogeneous computing (e.g. latency and bandwidth of synchronization and communication in heterogeneity-aware parallel kernels), (b) benchmark and model heterogenous processors composed of CPU, GPU and high-bandwidth memories (e.g. AMD Bristol Ridge, Intel Skylake and NVIDIA Tegra), (c) design and evaluate new data structures and algorithms for irregular problems aiming for fully use computing and memory resources provided by heterogeneous processors, and (d) integrate and apply the newly designed approaches for high-level applications (e.g. scientific software, graph databases and sparse convolutional neural networks). By empirically investigating these issues, the ultimate goal of the project is to allow a broad range of real-world applications to further benefit from heterogeneous hardware in the new era.