Periodic Reporting for period 1 - SHEPHERD (Cross-layer resilience for rack-scale disaggregated memory)
Período documentado: 2021-09-01 hasta 2023-08-31
Disaggregated memory is a memory-centric architecture that separates compute and memory into distinct nodes within a rack, connected by high-performance interconnects. Compute nodes provide processing power, while memory nodes offer large memory capacity, enabling flexible scaling of processing and memory to meet workload demands. This separation enhances memory utilization and scalability by allowing memory to be shared across multiple compute nodes.
The project’s overall objective is to address the challenge of memory error tolerance in disaggregated systems, where increasing bit error rates due to memory cell miniaturization and larger number of memory components pose reliability risks. Towards this objective, it develops cross-layer resilience techniques that combine memory protection and replication, with hardware and software working together to efficiently tolerate errors both within individual memory nodes and across nodes in the rack.
Additionally, the project developed a Replication-Aware Memory-Error Protection (RAMP) framework that enables resilience across multiple memory blades. RAMP integrates the lower-tier FlexDIMM protection with an upper memory-replication tier that uses rack-scale replication and erasure coding to correct errors that remain unaddressed by the FlexDIMM module. This tier offers a cost-effective and efficient method to handle memory errors, ensuring high availability with minimal storage overhead. An analytical model was also developed to guide the design of the two-tier memory resilience system, balancing performance, energy, and storage costs.
The scientific results of this project are documented in: (1) a technical report on applying the two-tier resilience framework to improve the efficiency of a resilient disaggregated memory system, (2) a paper summarizing the resilience approach and reliability model, along with another paper developing an experimental methodology for addressing performance variation, and (3) papers on improving the energy efficiency of data center servers, developed as part of the training activities.