Periodic Reporting for period 2 - FTHPC (Fault Tolerant High Performance Computing)
Okres sprawozdawczy: 2020-12-01 do 2022-05-31
(i) general purpose, requiring little to no algorithmic effort, but severely degrade performance (e.g. checkpoint-restart), or (ii) tailored to specific applications, very efficient, but significantly increasing programmers’ workload.
We propose a novel approach that achieves the best of both worlds: high performance and general purpose fault resilience. Objectives include: (1) Pushing the limits of resilience-enabling building blocks. (2) Developing low-overhead algorithmic tools that are as widely applicable as possible. (3) Improving usability and exposing automating opportunities.
We demonstrated that computing over encoded data can serve a dual purpose: mitigating faults on the one hand, and improving resource utilization on the other hand, including computation costs, communication costs, while reducing memory footprint. We generalized this approach and applied it to several algorithms and implementations.