Periodic Reporting for period 1 - Berti-Chip (Energy-Efficient Highly Accurate Data Prefetching)
Reporting period: 2024-05-01 to 2025-10-31
Berti is a data prefetcher developed within the ERC Consolidator Grant ECHO. The prefetcher is sited at the first-level data cache (L1D), which makes a compelling case for timeliness and accuracy. Initial simulation results show that Berti can boost processor performance by 33% (with respect to processors not using prefetching mechanisms) and 8.5% (when compared to mainstream prefetchers) while providing accuracy above 90%, which translates into a low energy overhead for the memory hierarchy. In addition, Berti is a cost-effective prefetcher that requires just 2.55 KB of storage—little when compared to the typical 48 KB of L1D cache storage (for data only).
The key objective of this project is to elaborate a hardware design for a Berti-like data prefetcher and test it in a real processor, beyond previous academic simulator tests. A key characteristic of Berti is the use of deltas (the distance in cache lines from the current access to a set of previous accesses), instead of strides (the distance in cache lines from the current access to the strictly previous access). In essence, Berti learns local deltas—that is, deltas obtained by tracking previous accesses from the same instruction. We strongly believe that the use of local deltas can offer a notable boost in processor performance and efficiency, and that its design simplicity makes this kind of prefetcher a serious candidate both for the emerging low-power edge market and for high-performance computers.
* Implementation of a local delta prefetcher in hardware: The next step was to implement a prefetcher in hardware using Verilog. For this, it was important to understand the cache interface offered to the prefetcher. Some processors offer richer interfaces, while others are more limited. In any case, these interfaces usually provide less information than academic simulators. Therefore, before implementing, we needed to select the target processor to accommodate its interface or extend it if necessary. After discussions with companies and academic partners, we chose the Sargantana core, developed by the Barcelona Supercomputing Center under the European Processor Initiative. However, Sargantana's L1D interface was quite limited, and extending it was not straightforward. Thus, we adapted our prefetcher to their interface instead of extending the interface itself. The resulting prefetcher was named DeltaRegion. It does not use IP for obtaining local deltas but instead uses nearby memory accesses as proxies for local behavior.
* Testing and refinement of the prefetcher in a real processor: After designing and implementing our prefetcher, we tested it on the Sargantana processor and began a process of fine-tuning to improve performance by identifying suboptimal design choices. After this process, our prefetcher outperformed Sargantana's native prefetcher by around 9% when running the typical workloads used by the Barcelona Supercomputing Center to test the Sargantana core.
* Business model analysis: We held meetings with industry partners and conducted a market survey to identify potential customers for our product. There is interest from industry in our prefetcher. The most promising path for technology transfer identified during this process is working together with industry partners to shape our prefetching methodology to their needs and licensing our products to them.
* A simplified and more performant version of Berti. This result has been documented in the following publication: Agustín Navarro-Torres, Biswabandan Panda, Jesús Alastruey-Benedé, Pablo Ibáñez, Víctor Viñals-Yúfera, Alberto Ros, "A Complexity-Effective Local Delta Prefetcher". IEEE Transactions on Computers (TC), vol. 74 (5), pages 1482--1494, January 2025.
* The Verilog code for a simplified local-delta prefetcher and its testing in the Sargantana core, achieving significant performance improvements. This result has been submitted for consideration to a top conference.
* The business plan and the conclusions drawn regarding the next actions to take to achieve the successful transfer of our technology to industry.