Skip to main content

Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures

Periodic Reporting for period 2 - ECHO (Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures)

Reporting period: 2021-03-01 to 2022-08-31

Multicore processors are present nowadays in most digital devices, from smartphones to high-performance servers. The increasing computational power of these processors is essential for enabling many important emerging application domains such as big-data, media, medical, or scientific modeling.

A fundamental technique to improve computers performance is speculation. This technique consists in executing work ahead of time before it is known if it is actually needed or even correct. If speculation is right, important performance gains are obtained. However, if the work executed speculatively is not useful a large penalty can be paid. For example, in hardware, speculation comes with a high cost to track the speculative state of the execution in case of misprediction, significantly increases energy consumption by performing operations that may be later discarded, and brings important security vulnerabilities, as shown by the Meltdown and Spectre attacks in 2018. In software (e.g. compilers), speculation is not enabled by default thus limiting optimizations and leading to sub-optimal performance.

ECHO "Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures" aims to remove the inefficiencies of speculation both at hardware and software levels, boosting the performance and energy efficiency of future computers. The motivation behind ECHO to improve speculative execution is to change the future in computers, that is, to alter the events that happen in a computer such that they follow the initial prediction. This way, the execution of the work performed ahead of time does not need to be re-started. As Abraham Lincoln once said:

"The best way to predict your future is to create it."

A key question in ECHO is: what if we could make speculative execution always to succeed? Then the execution will not be speculative anymore, and computers will get all the advantages of speculation, but without its cost, achieving more efficient run-time execution and enabling compile-time speculative optimizations, since the they will be always correct (they will not be actually speculative anymore).

The overall objectives of ECHO are:

1. ECHO-core: Efficient design of processing cores.

2. ECHO-sync: Efficient synchronization via lock elision and transactional memory.

3. ECHO-comp: Speculative compiler optimizations with hardware support.

4. ECHO-htrg: Efficient and easy-to-program heterogeneous systems.
ECHO-core.

In the front-end of the core, we developed the Entagling instruction prefetcher, a novel prefetching technique that won the 1st Instruction Prefetching Championship, 2020 [IEEE CAL'20]. Prefetching is another example of changing the future of events in a computer. We showed that, with a very low cost, Entangling brings 10% speedup on a large set of applications [ISCA'21].

In the back-end, we developed a selective and aggressive prefetcher that virtually removes store-buffer related stalls (95.0%) [Store Prefetch Bursts, MICRO'20]. We also proposed a speculative solution to enable store-to-load forwarding transparent to the programmer [Store atomicity, MICRO'20]. This solution is a first step towards an efficient implementation of Sequential Consistency. As a continuation, and in synergy with ECHO-htrg, we enabled inter-thread store to load forwarding [ITSLF, MICRO'21] allowing fast data communication between threads resulting in 12% speedup for communication-intensive applications. ITSLF was awarded an Honorable Mention at Micro TopPicks 2022 (i.e. among the 24 more relevant computer architecture papers in 2021). We removed the fences surrounding atomic operations [Free Atomics, ISCA'22, best paper sesion] resulting in 25% performance improvement when running 32 threads. Our prefetcher, BLUE, won the 1st ML-Based Data Prefetching Competition, and a follow up data prefetcher [Berti, MICRO'22] improved performance by 3.5% over the state-of-the-art.

ECHO-sync.

While analyzing programs critical sections, we found out room for improvements in the applications by using modern programming constructs. We proposed Splash-4 [ISPASS'21, IISWC'22], that reduce the execution time by 48% compared to their previous version, Splash-3. Then, targeting medium-size critical sections, we proposed MAD atomics [MICRO'21] which achieve non-speculative, non-deadlocking and concurrent execution of critical sections, thus improving performance by 2.7 times for a set of applications and concurrent data structures over an Intel RTM-like design. In parallel and in synergy with ECHO-core, we proposed delaying stores at the store buffer in hardware transactional memory [DeTraS, IEEE TPDS'22], which brings speedups of 25% for the STAMP benchmarks.

ECHO-comp.

We proposed Regional Out of Order Writes [ROOW, PACT'20], which shows that the store buffer limitations can also be addressed with a compiler that delimits safe regions of code in which stores can be reordered without breaking consistency. We also developed a compiler approach to non-speculative execution that removes important security vulnerabilities, and show that we can reduce the performance gap to an unsafe baseline by 53% (on average). Recently, we have reduced conflict misses in hardware transaction memory using a prefetching mechanism directed by the compiler [SUPE'22]. In synergy with ECHO-core, we proposed a fusion mechanism for non-contiguous instructions [MICRO'22], able to improve performance over state of the art fusion by 7%.

ECHO-htrg.

Our efficient coherence-based strict persistency work [TSOPER, HPCA'21] leverages the cache coherence protocol to provides ordering of writes in persistent memory without needing the programmer or compiler to be concerned about false sharing, data-race-free semantics, etc. We also explored efficient SIMD instructions with compiler support [TPDS'22] by compacting and restoring data used by vector operations offering speedup of 29% for a a set of applications with predicated execution.
Significant progress and novel contribution have been proposed and documented in 19 publications, many of them in flagship conferences (2 @ ISCA, 6 @ MICRO, 1 @ HPCA, 2 @ PACT) and journals (2 @ IEEE TPDS). We have proposed solutions to improve core's performance while maintaining strong consistency models. We have been the first proposing inter-thread store-to-load forwarding. Our prefetcher mechanisms (code publicly available) are the best performant ones of prefetching techniques whose code has been released, both for instructions and data. We have been the first in reaching a non-speculative fine-grain execution of small critical sections, and we are currently working towards larger critical sections. Our compiler techniques have shown to be instrumental in improving computer performance through a software-hardware co-design. We have designed secure and efficient cores and deal with the problem of persisting data in non-volatile memories in an efficient manner. We removed the fences of atomic operations while maintaining correctness and allowed store-to-load forwarding across them. We enabled non-consecutive instruction fusion.

Still important goals to be achieved during the next years of the project, looking for non-speculative solutions to strong consistency models and hardware transactional memory with large critical sections, proposing compiler optimizations for further performance improvements, and managing efficiently heterogeneous systems.
overview-echo.png