Fault Tolerant High Performance Computing

Periodic Reporting for period 2 - FTHPC (Fault Tolerant High Performance Computing)

Reporting period: 2020-12-01 to 2022-05-31

The future of super-computing depends on our ability to cope with errors. Increasing machine size and decreasing operating voltage generate soft errors (bit flips) and hard errors (component failure). Hardware trends imply at least two errors per minute on next generation (exascale) supercomputers. If a CPU or GPU breaks down, the computed results must remain trustworthy. The high performance computing and the theory of computing communities have been addressing this challenge for more than two decades. Most solutions are either
(i) general purpose, requiring little to no algorithmic effort, but severely degrade performance (e.g. checkpoint-restart), or (ii) tailored to specific applications, very efficient, but significantly increasing programmers’ workload.
We propose a novel approach that achieves the best of both worlds: high performance and general purpose fault resilience. Objectives include: (1) Pushing the limits of resilience-enabling building blocks. (2) Developing low-overhead algorithmic tools that are as widely applicable as possible. (3) Improving usability and exposing automating opportunities.

We obtained fault tolerant parallel matrix multiplication algorithms that reduce the resource overhead by minimizing both the number of additional processors and the communication costs. We obtained fault resilience with small overhead costs, for Strassen's and other recursive fast matrix multiplication algorithms. We propose a new straggler mitigation solution for delay faults, and apply it to distributed matrix multiplication.
We demonstrated that computing over encoded data can serve a dual purpose: mitigating faults on the one hand, and improving resource utilization on the other hand, including computation costs, communication costs, while reducing memory footprint. We generalized this approach and applied it to several algorithms and implementations.

Our algorithmic solutions allow for faults tolerance, while providing high performance, that no previous state-of-the-art solution can guarantee. Furthermore, the encoded computation yields the fastest matrix multiplication implementations. E.g our solutions significantly outperform Intel's dgemm, the state of the art for CPU.

Periodic Reporting for period 2 - FTHPC (Fault Tolerant High Performance Computing)

Share this page

Download