CORDIS - EU research results

Coordination and Composability: The Keys to Efficient Memory System Design

Periodic Reporting for period 4 - CC-MEM (Coordination and Composability: The Keys to Efficient Memory System Design)

Reporting period: 2021-09-01 to 2022-08-31

Problem: The difficulty in managing data movement in complex computer systems causes efficiency (energy) and performance losses.
Important: A significant amount of our computing energy goes into moving data. As computer systems are power-limited (batteries on mobile devices, cooling on all devices), decreasing the energy spent on moving data will allow us to increase performance and/or battery life.
Overall objectives: Improve data movement efficiency by coordinating data movement across the different parts of the system.
The project has addressed memory system efficiency in three areas: 1) the interaction between instructions and scheduling, 2) complex memory systems, and 3) traditional memory systems.

Interaction between instructions and scheduling: We have analyzed the behavior of memory instructions and their interactions with other instructions in the processor. This has given us insights into how we can efficiently construction hardware schedulers that allow instructions to execute nearly as well as expensive out-of-order schedulers, with far less cost. The results are significantly increased scheduling efficiency and decreased complexity.

Complex memory systems: We have analyzed the interaction between applications that execute on many distributed processors (both graphics workloads on GPUs and large-scale NUMA workloads) to determine how best to optimize memory system behavior. In both cases we found that combining knowledge of the hardware and software allowed us to significantly improve performance, but doing so required clever techniques to explore/understand how to configure the applications and hardware.

Traditional memory systems: We have analyzed the interaction between memory requests and the existing processor pipeline and identified that we can take advantage of existing structures in the processor to improve efficiency with essentially no overhead. This has allowed us to transform both the store buffer and the register file into caches, thereby significantly reducing the energy spent accessing the first-level cache. In addition to working within the processor, we have improved the interactions between the processor and the OS through the virtual memory system. This has resulted in improvements to the allocation of large pages in fragmented systems and a re-design of the 40-year-old choices we are still using in today's virtual memory paging systems. The latter has resulted in a design that is both enough better and simple enough that it is being included in the future design of most mobile processors.
The work described above demonstrates progress beyond the state of the art in power-efficient instruction scheduling, optimization of complex memory systems for both graphics and compute, and improvements in memory system efficiency in traditional systems. We expect that this work will continue and expand its use of memory system metadata to track and optimize system performance.
Complex Memory System Behavior - GCC miss rates over time by cache size with prefetching
Methods for flattening the page table to reduce accesses for large memory systems.