Periodic Reporting for period 1 - DURO (DURO: Deep-memory Ubiquity, Reliability and Optimization)
Reporting period: 2016-10-01 to 2018-09-30
However, increasing computational power must come together with an increase in data availability, that is to say fast data movement between processors and storage. To this end, deep memory hierarchies have come to exist, with several storage devices offering multiple trade-offs between capacity, latency, bandwidth, resilience, among others. Using those deep memory hierarchies efficiently is mandatory to extract all the computational power from future extreme scale machines.
The objectives of this project is to design strategies to increase simultaneously the fault tolerance and the performance of scientific applications in deep memory hierarchies. For that it is important to analyze the applications usually running on large supercomputers and extract features that could map well with the memory limitations existing in the systems.
We have also analyzed the performance counters of modern hardware devices, to perform automatic online steering for data placement. Previous efforts have shown that such an online monitoring at such a fine granularity would lead to extremely large overheads, rendering the technique prohibitively expensive. However, we have overcome this issue by using a low-level strategy integrated inside a light kernel that avoids most of the overheads related to such monitoring, making this technique available for real executions.
Moreover, we studied multiple representatives applications and developed algorithmic techniques to allow them to recover from memory errors and data corruption without having to roll-back to a previous checkpoint. This technique was implemented and evaluated demonstrating their efficacy and a potential to reduce resilience overhead. A detailed study of the memory usage of all those applications was performed showing the different reliability requirements for each dataset.
All the work has been published in three different papers presented at the Supercomputing Conference in Dallas.
MCHPC'18 : https://passlab.github.io/mchpc/mchpc2018/#program
PMBS'18 : https://www.dcs.warwick.ac.uk/pmbs/pmbs/PMBS/Schedule.html
FTXS'18 : https://sites.google.com/site/ftxsworkshop/home/ftxs-2018
Concerning wider societal dissemination the following activities were done within the DURO project:
1st BSC Hackathon - October 21, 2016 ( http://hackathon2016.bsc.es/winners )
Supercomputer OpenHouse - October 22, 2016
Supercomputer OpenHouse - October 21, 2017
2nd BSC Hackathon - November 3, 2017 ( http://hackathon2017.bsc.es/winners ) ( https://www.eic.cat/sites/default/files/publicacions/fulls_dels_enginyers_23_novembre_2017.pdf )
BSC Career Day - November 24, 2017
HiPEAC Reliability Tutorial - January 22, 2018 ( https://www.hipeac.net/2018/manchester/schedule/#TUTORI )
Pint of Science - May 16, 2018
HiPEAC Careers Mentoring - May 24, 2018
Science Festival - June 9, 2018 ( Photos attached )
The custom performance counter driver developed during this project was implemented and tested with multiple scientific applications running over a hundred thousand processes and showcasing less than 10% overhead while capturing the data movements over different storage levels. This research was published at the Workshop on Memory Centric High performance Computing.
Finally, our work on forward recovery for memory errors demonstrated up to 14% reduction in resilience overhead, while imposing negligible difference on failure-free executions. The research paper was published at the 8th Workshop on Fault Tolerance for HPC at Extreme Scale. Furthermore the implementation was integrated inside the FTI library and released as open source software (https://github.com/leobago/fti).