Cross-layer resilience for rack-scale disaggregated memory

Información del proyecto

SHEPHERD

Identificador del acuerdo de subvención: 101029391

DOI

10.3030/101029391

Proyecto cerrado

Fecha de la firma de la CE 29 Marzo 2021

Fecha de inicio 1 Septiembre 2021

Fecha de finalización 30 Diciembre 2024

Financiado con arreglo a

EXCELLENT SCIENCE - Marie Skłodowska-Curie Actions

Coste total

€ 145 941,12

Aportación de la UE

€ 145 941,12

145 941,12

Coordinado por

UNIVERSITY OF CYPRUS
Cyprus

Periodic Reporting for period 1 - SHEPHERD (Cross-layer resilience for rack-scale disaggregated memory)

Período documentado: 2021-09-01 hasta 2023-08-31

The project “Cross-layer resilience for rack-scale disaggregated memory” develops cross-layer resilience techniques to mitigate memory errors in disaggregated memory systems.

Disaggregated memory is a memory-centric architecture that separates compute and memory into distinct nodes within a rack, connected by high-performance interconnects. Compute nodes provide processing power, while memory nodes offer large memory capacity, enabling flexible scaling of processing and memory to meet workload demands. This separation enhances memory utilization and scalability by allowing memory to be shared across multiple compute nodes.

The project’s overall objective is to address the challenge of memory error tolerance in disaggregated systems, where increasing bit error rates due to memory cell miniaturization and larger number of memory components pose reliability risks. Towards this objective, it develops cross-layer resilience techniques that combine memory protection and replication, with hardware and software working together to efficiently tolerate errors both within individual memory nodes and across nodes in the rack.

The project developed cross-layer resilience techniques to tolerate memory errors in both individual memory nodes and across multiple memory nodes within a rack. One of the key innovations was the development of FlexDIMM, a memory module designed to provide configurable protection against random bit cell errors, particularly in high-density memory technologies like Resistive RaM (ReRAM). FlexDIMM uses two types of protection codes: a fixed-protection code that opportunistically corrects bit errors, and a configurable code that employs BCH error-correcting codes (ECC) to address more severe errors. This module serves as the lower-tier protection in a two-tier resilience approach, complementing an upper-tier memory-replication scheme that ensures more robust error correction. An analytical model was created to assess the reliability of FlexDIMM, predicting its failure rates and helping to refine the design of the overall memory error protection system.

Additionally, the project developed a Replication-Aware Memory-Error Protection (RAMP) framework that enables resilience across multiple memory blades. RAMP integrates the lower-tier FlexDIMM protection with an upper memory-replication tier that uses rack-scale replication and erasure coding to correct errors that remain unaddressed by the FlexDIMM module. This tier offers a cost-effective and efficient method to handle memory errors, ensuring high availability with minimal storage overhead. An analytical model was also developed to guide the design of the two-tier memory resilience system, balancing performance, energy, and storage costs.

The scientific results of this project are documented in: (1) a technical report on applying the two-tier resilience framework to improve the efficiency of a resilient disaggregated memory system, (2) a paper summarizing the resilience approach and reliability model, along with another paper developing an experimental methodology for addressing performance variation, and (3) papers on improving the energy efficiency of data center servers, developed as part of the training activities.

This project advances the state of the art in resilient memory systems by introducing a novel two-tier resilience framework for disaggregated memory architectures. Unlike existing solutions that focus on error correction at a single layer, the proposed approach integrates lower-tier memory protection with upper-tier memory replication, providing resilience across both individual memory nodes and multiple nodes within a rack. This dual-layer strategy enhances fault tolerance while minimizing storage overhead through the efficient use of erasure coding and replication techniques. The project also develops analytical models to guide system design, balancing performance, energy efficiency, and reliability, addressing gaps in current memory error resilience methods. This work lays the foundation for more efficient, fault-tolerant disaggregated memory systems, a critical area of interest with the rise of high-speed, low-latency interconnects like Compute Express Link (CXL).

SHEPHERD Logo

Periodic Reporting for period 1 - SHEPHERD (Cross-layer resilience for rack-scale disaggregated memory)

Descargar Descargar el contenido de la página