Projektbeschreibung
Supercomputer gegen Fehler absichern
Wissenschafts-, Technik- und Industriegemeinden verlassen sich in hohem Maße auf Supercomputer und deren Fähigkeit, effizient zu arbeiten. Aufgrund der höheren Rechenleistung und des größeren Arbeitsspeichers werden bei Supercomputern der nächsten Generation (Exa-Rechner) jedoch voraussichtlich mindestens zwei Fehler pro Minute auftreten. Daher ist es unerlässlich, einfache und effektive Lösungen zur Verbesserung der Fehlertoleranz zu finden, die kein hohes Maß an Fachwissen erfordern. Ziel des EU-finanzierten Projekts FTHPC ist es, das Problem der Fehlertoleranz zu lösen, indem die jüngsten Fortschritte bei Fehlerkorrekturcodes und kurzen, probabilistisch überprüfbaren Beweisen genutzt werden. Der Erfolg dieses Vorhabens wird den Bedarf an Fehlertoleranz-Fachwissen beseitigen und das Exa-Hochleistungsrechnen für alle in den Bereichen Algorithmenentwicklung und Programmierung Tätigen zugänglich machen.
Ziel
Supercomputers are strategically crucial for facilitating advances in science and technology: in climate change research, accelerated genome sequencing towards cancer treatments, cutting edge physics, devising engineering innovative solutions, and many other compute intensive problems. However, the future of super-computing depends on our ability to cope with the ever increasing rate of faults (bit flips and component failure), resulting from the steadily increasing machine size and decreasing operating voltage. Indeed, hardware trends predict at least two faults per minute for next generation (exascale) supercomputers.
The challenge of ascertaining fault tolerance for high-performance computing is not new, and has been the focus of extensive research for over two decades. However, most solutions are either (i) general purpose, requiring little to no algorithmic effort, but severely degrading performance (e.g. checkpoint-restart), or (ii) tailored to specific applications and very efficient, but requiring high expertise and significantly increasing programmers' workload. We seek the best of both worlds: high performance and general purpose fault resilience.
Efficient general purpose solutions (e.g. via error correcting codes) have revolutionized memory and communication devices over two decades ago, enabling programmers to effectively disregard the very
likely memory and communication errors. The time has come for a similar paradigm shift in the computing regimen. I argue that exciting recent advances in error correcting codes, and in short probabilistically checkable proofs, make this goal feasible. Success along these lines will eliminate the bottleneck of required fault-tolerance expertise, and open exascale computing to all algorithm designers and programmers, for the benefit of the scientific, engineering, and industrial communities.
Wissenschaftliches Gebiet
- medical and health sciencesclinical medicineoncology
- engineering and technologyelectrical engineering, electronic engineering, information engineeringelectronic engineeringcomputer hardwaresupercomputers
- natural sciencesearth and related environmental sciencesatmospheric sciencesclimatologyclimatic changes
- natural sciencesbiological sciencesgeneticsgenomes
Schlüsselbegriffe
Programm/Programme
Thema/Themen
Finanzierungsplan
ERC-COG - Consolidator GrantGastgebende Einrichtung
91904 Jerusalem
Israel