Skip to main content

Load Slice Core: A Power and Cost-Efficient Microarchitecture for the Future

Periodic Reporting for period 2 - Load Slice Core (Load Slice Core: A Power and Cost-Efficient Microarchitecture for the Future)

Reporting period: 2019-07-01 to 2020-12-31

While improving computer system performance has always been important, continuing to do so under stringent power constraints (i.e. dark silicon) is increasingly challenging, yet critical for many emerging applications. Modern-day superscalar (out-of-order) processors deliver high performance but incur high design complexity, high power consumption and large chip area. Low-power (in-order) cores on the other hand are inherently more power and cost-efficient, but their major disadvantage is limited performance. The ideal processor microarchitecture delivers high performance at low power and low cost, which might enable new applications in both the server and mobile/embedded spaces. This project explores and proposes enhancements at the core level as well as at the chip level to improve performance in a power- and cost-efficient way. The project makes contributions to (1) accelerate single-thread performance, (2) performance and power prediction, (3) scheduling chip-level performance, and (4) workload-specific optimization.
The project made various contributions along four research avenues. (1) We propose microarchitecture enhancements to improve single-thread performance, including Probabilistic Branch Support [MICRO 2018], Precise Runahead Execution [HPCA 2020] and Forward Slice Core [PACT 2020]. (2) We propose the RPPM performance model for multicore CPUs [ISPASS 2019] and the MDM model for GPUs [MICRO 2020]. The HSM slowdown model enables fair and QoS-aware resource management in multitasking GPUs [ASPLOS 2020]. (3) We propose write-rationing garbage collection to manage hybrid memories for CPUs [PLDI 2018, SIGMETRICS 2019]. We propose PAE address mapping to balance memory utilization in GPUs [ISCA 2018] and adaptive memory-side last-level caching to trade off bandwidth versus capacity in GPUs [ISCA 2019, MICRO 2020]. (4) We extend the multicore simulator Sniper to the ARM instruction-set architecture [ISPASS 2019], we develop an emulation platform for evaluating future hybrid memory systems on existing commodity hardware [ISPASS 2019], and we develop rigorous benchmarking methodology for Python workloads [IISWC 2020].
The project advances the state-of-the-art in a number of ways. (1) We are the first to provide microarchitecture support for probabilistic computation; we are the first to observe that there are unused processor resources during runahead execution; we are the first to explore forward-slice core microarchitectures as opposed to the previously proposed backward-slice core microarchitectures. (2) We advance the state-of-the-art in CPU and GPU modeling by expanding the scope, improving the accuracy and modeling speed. We further demonstrate that hybrid mechanistic/empirical modeling is key for accurate and effective GPU slowdown prediction. (3) We demonstrate that randomized address mapping maximizes GPU memory bandwidth utilization, and we propose adaptive memory-side caching which trades off cache bandwidth for capacity. (4) We propose for the first time an automated simulator validation and fine-tuning methodology which we successfully apply to the ARM version of Sniper. We debunk current practice in Python benchmarking by proposing a rigorous methodology.