Load Slice Core: A Power and Cost-Efficient Microarchitecture for the Future

Información del proyecto

Load Slice Core

Identificador del acuerdo de subvención: 741097

DOI

10.3030/741097

Proyecto cerrado

Fecha de la firma de la CE 8 Mayo 2017

Fecha de inicio 1 Enero 2018

Fecha de finalización 31 Diciembre 2023

Financiado con arreglo a

EXCELLENT SCIENCE - European Research Council (ERC)

Coste total

€ 2 499 500,00

Aportación de la UE

€ 2 499 500,00

2 499 500,00

Coordinado por

UNIVERSITEIT GENT
Belgium

Periodic Reporting for period 4 - Load Slice Core (Load Slice Core: A Power and Cost-Efficient Microarchitecture for the Future)

Período documentado: 2022-07-01 hasta 2023-12-31

While improving computer system performance has always been important, continuing to do so under stringent power constraints (i.e. dark silicon) is increasingly challenging, yet critical for many emerging applications. Modern-day superscalar (out-of-order) processors deliver high performance but incur high design complexity, high power consumption and large chip area. Low-power (in-order) cores on the other hand are inherently more power and cost-efficient, but their major disadvantage is limited performance. The ideal processor microarchitecture delivers high performance at low power and low cost, which might enable new applications in both the server and mobile/embedded spaces. This project explored and proposed various enhancements at the core level as well as at the chip level to improve performance in a power- and cost-efficient way. The project made contributions to (1) accelerate single-thread performance, (2) performance and power prediction, (3) scheduling chip-level performance, and (4) workload-specific optimization, some of which were recognized with prestigious awards.

The project made various contributions along four research avenues. (1) We proposed microarchitecture enhancements to improve single-thread performance, including Probabilistic Branch Support [MICRO 2018], Precise Runahead Execution [HPCA 2020], Forward Slice Core [PACT 2020] and Vector Runahead [ISCA 2021, MICRO 2023]. (2) We proposed time-proportional performance analysis: TIP [MICRO 2021] and TEA [ISCA 2023], the RPPM performance model for multicore CPUs [ISPASS 2019], the MDM model for GPUs [MICRO 2020], and scale-model simulation [CAL 2021, ISPASS 2022, HPCA 2024]. The HSM slowdown model enables fair and QoS-aware resource management in multitasking GPUs [ASPLOS 2020]. (3) We proposed write-rationing garbage collection to manage hybrid memories for CPUs [PLDI 2018, SIGMETRICS 2019]. We propose PAE address mapping to balance memory utilization in GPUs [ISCA 2018], adaptive memory-side last-level caching to trade off bandwidth versus capacity in GPUs [ISCA 2019, MICRO 2020], resource management in heterogeneous system-on-chips [HPCA 2022], caching in multi-chip GPU systems [ISCA 2023]. (4) We extended the multicore simulator Sniper to the ARM instruction-set architecture [ISPASS 2019], we developed an emulation platform for evaluating future hybrid memory systems on existing commodity hardware [ISPASS 2019], we developed rigorous benchmarking methodology for Python workloads [IISWC 2020], representative GPU workloads [IISWC 2021], GPU simulation methodology [ISPASS 2023].

The project advances the state-of-the-art in a number of ways. (1) We were the first to provide microarchitecture support for probabilistic computation; we were the first to observe that there are unused processor resources during runahead execution; we were the first to explore forward-slice core microarchitectures as opposed to the previously proposed backward-slice core microarchitectures; we were the first to propose vector runahead techniques to extract massive memory-level parallelism from challenging graph analytics workloads with chains of dependent memory accesses. (2) We advanced the state-of-the-art in CPU and GPU modeling by expanding the scope, improving the accuracy and modeling speed. We further demonstrated that hybrid mechanistic/empirical modeling is key for accurate and effective GPU slowdown prediction. We proposed a novel performance prediction methodology called scale-model simulation to predict performance on large-scale systems through the simulation of smaller-scale miniature system configurations. (3) We demonstrated that randomized address mapping maximizes GPU memory bandwidth utilization, we proposed adaptive memory-side caching which trades off cache bandwidth for capacity, we proposed an effective cache hierarchy organization for multi-chip GPU systems. (4) We proposed for the first time an automated simulator validation and fine-tuning methodology which we successfully apply to the ARM version of Sniper. We debunked current practice in Python benchmarking by proposing a rigorous methodology. And we developed a sound benchmarking methodology for long-running GPU-compute workloads.

screen-shot-2024-02-20-at-5-30-07-pm.png

Periodic Reporting for period 4 - Load Slice Core (Load Slice Core: A Power and Cost-Efficient Microarchitecture for the Future)

Descargar Descargar el contenido de la página