Skip to main content
Weiter zur Homepage der Europäischen Kommission (öffnet in neuem Fenster)
Deutsch Deutsch
CORDIS - Forschungsergebnisse der EU
CORDIS

Robust and Energy-Efficient Numerical Solvers Towards Reliable and Sustainable Scientific Computations

Periodic Reporting for period 1 - Robust (Robust and Energy-Efficient Numerical Solvers Towards Reliable and Sustainable Scientific Computations)

Berichtszeitraum: 2019-09-01 bis 2021-08-31

In High-Performance Computing (HPC), every six months there is an evaluation of top performing systems in terms of actual floating-point operations per second (flops); every 2-4 years the system on the top changes by another often with the new and better performing hardware. This competition forms the TOP500 list. All these systems consume enormous amount of power. Hence, there is an obvious question of power and energy efficiency in HPC. Thus, power and energy efficiency as well as time-to-solution are the sustainability aspects of HPC. Note that there is also the Green TOP500 list with the better ratio of flops per watt. While there are works on hardware, e.g. the European Processor Initiative (EPI), there are also works on applications and algorithms to make them more energy efficient.

Computations in parallel environments, like the emerging Exascale systems, are usually orchestrated by complex runtimes that employ various strategies to uniformly and efficiently distribute computations and data. However, these strategies, pursuing excellent performance and scalability, may also impair numerical reliability (accuracy and reproducibility) of final results due to the dynamic and, thus, non-deterministic execution as well as non-associativity of floating-point operations.

In this project, we are primarily focused on fundamental algorithmic solutions, which often are in the heart of real-world applications, and foresee to collaborate with hardware experts for a possible joint undertaking. In particular, we aim to make algorithms numerically reliable, meaning that users can always rely on the output result for different problems and various configurations of the same or another system. Numerical reliability is associated with accuracy (quality of results) and reproducibility (ability to obtain the same results on repeated executions). Additionally, scientific computations frequently rely upon only one working precision for computing problems with various complexities, which leads to the significant underutilization of the floating-point representation or the lack of accuracy. We aim to develop numerically reliable algorithms (also called robust algorithms) with a possibility to adjust them to the actual working precision pursuing the goal of sustainable computations.
We have developed a library for fast, accurate, and reproducible fundamental linear algebra operations that are frequently used in scientific computations as the underlying layer. The library is called Exact BLAS (Basic Linear Algebra Subprograms). In Robust, our focus is on enhancing and expanding the ExBLAS library but even more on developing robust iterative linear solvers, which are frequently used in various scientific computations and consume significant amount of their execution time. Our first target was the Preconditioned Conjugate Gradient (PCG) method for solving symmetric sparse systems such as the ones from mashing surfaces of planes and cars. This time we have approached the problem from different perspective: instead of making all floating-point operations strictly reproducible, which can become expensive, we have decided to at first identify these parts in the PCG solver that are prone to non-reproducibility and lack of accuracy. While working on this task, we have gained precision knowledge on iterative solvers, architectures, and compiler optimization (e.g. random replacement of instructions in the favor of the fused multiply-add, fma, instruction). Therefore, we fix these issues by expanding and modifying the ExBLAS approach to tackle parallel computations of residual (dot product of two vectors), restricting compiler to the explicit user-defined usage of the fma instruction, etc. However, after analyzing the actual required precision of the PCG solver, we understand that there is a possibility to enhance our approach even further. Thus, we use a less expensive part of ExBLAS (a short vector of floating-point numbers to store both error and result) with additional optimization technique. Phenomenally, a vector with only three elements is enough to get accurate and reproducible result in practice due to very small and insignificant error that is propagated to the tail of this vector although this is difficult to prove using the rigorous and often pessimistic numerical analysis tools. This approach enhances the performance of robust PCG even further. Furthermore, we prove that both approaches deliver identical intermediate and final results in terms of residuals and number of iterations for various number of resources used but also cross-platform reproducibility, which is a rare case. We conduct our tests on synthetic matrices as well as on real cases from the SuiteSparse matrix collection. We extend this research to the Krylov Subspace method called Preconditioned BiCGStab that works on unsymmetric matrices as well as to its pipelined version, which means better suited for runs on large scale. With this, we provide reproducible version of iterative solvers that are capable to work on different types of matrices.
To our knowledge, there is no work on accurate and reproducible iterative solvers on multi-node clusters. All the works are either focused on GPUs (Mukunoki et al., our collaborators, also cover Conjugate Gradient) or target primarily BLAS operations. Moreover, we prove that both ExBLAS-like and novel lightweight strategies deliver identical intermediate and final results in terms of residuals and number of iterations on various number of cores/ nodes. These implementations prove cross-platform (tested on three different clusters) reproducibility, which is a rare case. Furthermore, we use two different parallel programming models/ implementations: one is based on a single programming model called Message Passing Interface (MPI); the other follows the hybrid programming models approach with MPI and OpenMP tasks, mixing them for better performance inside nodes with many cores (OpenMP) as well as across nodes (MPI). Notably, these two robust implementations secure identical results at scale with the overhead within 40%. Finally, even modern libraries still rely on sequential execution for validation and verification of their results -- but here we propose parallel solutions that scale with the number of hardware resources and can be executed much faster compared to the sequential runs.

No website has been developed for the project. I refer to my personal web-page.
Preconditioned Conjugate Gradient solver with dependencies among kernels
Mein Booklet 0 0