Periodic Reporting for period 1 - ECHO (Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures)
Reporting period: 2019-09-01 to 2021-02-28
A fundamental technique to improve computers performance is speculation. This technique consists in executing work ahead of time before it is known if it is actually needed or even correct. If speculation is right, important performance gains are obtained. However, if the work executed speculatively is not useful a large penalty can be paid. For example, in hardware, speculation comes with a high cost to track the speculative state of the execution in case of misprediction, significantly increases energy consumption by performing operations that may be later discarded, and brings important security vulnerabilities, as shown by the Meltdown and Spectre attacks in the beginning of 2018. In software (e.g. compilers), speculation is not enabled by default thus limiting optimizations and leading to sub-optimal performance.
The ERC Consolidator Grant project ECHO ""Extending Coherence for Hardware-Driven Optimizations in Multicore Architectures"", aims to remove the inefficiencies of speculation both at hardware and software levels, boosting the performance and energy efficiency of future computers. Since performance in current multicore processors is limited by their power budget, it is imperative to make multicore processors as energy-efficient as possible to increase performance even further.
The motivation behind ECHO to improve speculative execution is to change the future in computers, that is, to alter the events that happen in a computer such that they follow the initial prediction. This way, the execution of the work performed ahead of time does not need to be re-started. As Abraham Lincoln once said:
""The best way to predict your future is to create it.""
A key question in ECHO is: what if we could make speculative execution always to succeed? Then the execution will not be speculative anymore, and computers will get all the advantages of speculation, but without its cost, achieving more efficient run-time execution and enabling compile-time speculative optimizations, since the they will be always correct (they will not be actually speculative anymore)."
Current computers use several memory (cache) levels in order to access fast data that the processing units need. Lower levels, closer to the processing units, are smaller and faster than higher levels. When a processor accesses data, the ideal scenario is that data are present in the lowest level, guaranteeing a fast access to the information, and therefore high performance.
A key technique to find requested data in the lowest cache level is to bring (prefetch) data predicted to be used in a near future to that cache level. Prefetching is another example of changing the future: an access that would not find the data in the lowest cache level, will indeed find it if the memory system places it there in advance to the actual access.
The first outcome of ECHO is a novel prefetching technique, named Entangling Prefetchers. In an analogy to quantum entanglement, Entangling Prefetchers consist in pairing two pieces of data such that the access of one by the processor will imply the access of the other in a near future (Figure 1). Entanglement is done y carefully measuring the time data need to be stored in the lowest cache level, and one the first piece of data is accessed by the processor, the second of is requested to the memory system too for placing it in the lowest cache level.
An Entangled Prefetcher for Instructions won recently the first Instruction Prefetching Championship, 2020, which belongs to a series of prestigious championships in computer architecture. The Entangling Prefetcher obtained 29.5% performance improvements over a computer that do not implements instruction prefetching. An energy- and area- efficient version of the Entangling Instruction Prefetcher obtaining over 28% performance improvements has been published in the peer-reviewed journal IEEE Computer Architecture Letters, 2020. Using a more advanced baseline model and with a very low cost we show 10% improvements when using the Entangling Instruction Prefetcher over a large (~1000) set of applications, in our publication at the International Symposium on Computer Architecture, 2021.
* Store Prefetch Bursts
* Store Atomicity
* Regional Out of Order Writes
Still important goals to be achieved during the next years of the project.