Skip to main content

DURO: Deep-memory Ubiquity, Reliability and Optimization

Periodic Reporting for period 1 - DURO (DURO: Deep-memory Ubiquity, Reliability and Optimization)

Reporting period: 2016-10-01 to 2018-09-30

Supercomputers are extremely important for scientific progress. Scientists use large supercomputers to model a large variety of phenomena, from sub-atomic particle behavior to the formation of galaxies, passing by a large number of industrial applications. Understanding climate change, mental disorders and new materials are just some of the many applications that require extremely large and long simulation in supercomputers. High-performance computing has been growing exponentially for the last decades, and they still need to increase their computational power by several orders of magnitude in order to run the most ambitious scientific simulations on them.

However, increasing computational power must come together with an increase in data availability, that is to say fast data movement between processors and storage. To this end, deep memory hierarchies have come to exist, with several storage devices offering multiple trade-offs between capacity, latency, bandwidth, resilience, among others. Using those deep memory hierarchies efficiently is mandatory to extract all the computational power from future extreme scale machines.

The objectives of this project is to design strategies to increase simultaneously the fault tolerance and the performance of scientific applications in deep memory hierarchies. For that it is important to analyze the applications usually running on large supercomputers and extract features that could map well with the memory limitations existing in the systems.
"During the DURO project we have performed a deep analysis of a multi-grid application and its storage and computational limitations. We have developed an adaptive precision mechanism to dynamically adapt the precision of the datasets using during the computation. The mechanisms monitors the error evolution during runtime and switch to another precision when the error is below a certain threshold.

We have also analyzed the performance counters of modern hardware devices, to perform automatic online steering for data placement. Previous efforts have shown that such an online monitoring at such a fine granularity would lead to extremely large overheads, rendering the technique prohibitively expensive. However, we have overcome this issue by using a low-level strategy integrated inside a light kernel that avoids most of the overheads related to such monitoring, making this technique available for real executions.

Moreover, we studied multiple representatives applications and developed algorithmic techniques to allow them to recover from memory errors and data corruption without having to roll-back to a previous checkpoint. This technique was implemented and evaluated demonstrating their efficacy and a potential to reduce resilience overhead. A detailed study of the memory usage of all those applications was performed showing the different reliability requirements for each dataset.

All the work has been published in three different papers presented at the Supercomputing Conference in Dallas.

MCHPC'18 :
PMBS'18 :
FTXS'18 :

Concerning wider societal dissemination the following activities were done within the DURO project:

1st BSC Hackathon - October 21, 2016 ( )
Supercomputer OpenHouse - October 22, 2016
Supercomputer OpenHouse - October 21, 2017
2nd BSC Hackathon - November 3, 2017 ( ) ( )
BSC Career Day - November 24, 2017
HiPEAC Reliability Tutorial - January 22, 2018 ( )
Pint of Science - May 16, 2018
HiPEAC Careers Mentoring - May 24, 2018
Science Festival - June 9, 2018 ( Photos attached )

Our work on the multi-grid solver was the first adaptive precision technique with dynamic adaptation during runtime, reaching between 15% and 30% improvement in time to completion without loss in accuracy and it has been published at the 9th IEEE International Workshop on Performance modeling, benchmarking and simulation on high performance computing.

The custom performance counter driver developed during this project was implemented and tested with multiple scientific applications running over a hundred thousand processes and showcasing less than 10% overhead while capturing the data movements over different storage levels. This research was published at the Workshop on Memory Centric High performance Computing.

Finally, our work on forward recovery for memory errors demonstrated up to 14% reduction in resilience overhead, while imposing negligible difference on failure-free executions. The research paper was published at the 8th Workshop on Fault Tolerance for HPC at Extreme Scale. Furthermore the implementation was integrated inside the FTI library and released as open source software (