During the DURO project we have performed a deep analysis of a multi-grid application and its storage and computational limitations. We have developed an adaptive precision mechanism to dynamically adapt the precision of the datasets using during the computation. The mechanisms monitors the error evolution during runtime and switch to another precision when the error is below a certain threshold.
We have also analyzed the performance counters of modern hardware devices, to perform automatic online steering for data placement. Previous efforts have shown that such an online monitoring at such a fine granularity would lead to extremely large overheads, rendering the technique prohibitively expensive. However, we have overcome this issue by using a low-level strategy integrated inside a light kernel that avoids most of the overheads related to such monitoring, making this technique available for real executions.
Moreover, we studied multiple representatives applications and developed algorithmic techniques to allow them to recover from memory errors and data corruption without having to roll-back to a previous checkpoint. This technique was implemented and evaluated demonstrating their efficacy and a potential to reduce resilience overhead. A detailed study of the memory usage of all those applications was performed showing the different reliability requirements for each dataset.
All the work has been published in three different papers presented at the Supercomputing Conference in Dallas.
MCHPC'18 :
https://passlab.github.io/mchpc/mchpc2018/#program(se abrirá en una nueva ventana)PMBS'18 :
https://www.dcs.warwick.ac.uk/pmbs/pmbs/PMBS/Schedule.html(se abrirá en una nueva ventana)FTXS'18 :
https://sites.google.com/site/ftxsworkshop/home/ftxs-2018(se abrirá en una nueva ventana)Concerning wider societal dissemination the following activities were done within the DURO project:
1st BSC Hackathon - October 21, 2016 ( http://hackathon2016.bsc.es/winners )
Supercomputer OpenHouse - October 22, 2016
Supercomputer OpenHouse - October 21, 2017
2nd BSC Hackathon - November 3, 2017 ( http://hackathon2017.bsc.es/winners ) ( https://www.eic.cat/sites/default/files/publicacions/fulls_dels_enginyers_23_novembre_2017.pdf )
BSC Career Day - November 24, 2017
HiPEAC Reliability Tutorial - January 22, 2018 ( https://www.hipeac.net/2018/manchester/schedule/#TUTORI )
Pint of Science - May 16, 2018
HiPEAC Careers Mentoring - May 24, 2018
Science Festival - June 9, 2018 ( Photos attached )