Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Fault Tolerant High Performance Computing

Project description

Insuring supercomputers against faults

Scientific, engineering and industrial communities rely heavily on supercomputers and their ability to perform efficiently. With their increased processing power and memory, next-generation (exascale) supercomputers are predicted to encounter at least two faults per minute, so it is imperative to find simple and effective solutions to enhance fault tolerance that won’t require high levels of expertise. The EU-funded FTHPC project aims to resolve the issue of fault tolerance by using recent advances in error correcting codes and short probabilistically checkable proofs. The success of this endeavour will eliminate the need for fault-tolerance expertise and make exascale computing accessible to all algorithm designers and programmers.

Objective

Supercomputers are strategically crucial for facilitating advances in science and technology: in climate change research, accelerated genome sequencing towards cancer treatments, cutting edge physics, devising engineering innovative solutions, and many other compute intensive problems. However, the future of super-computing depends on our ability to cope with the ever increasing rate of faults (bit flips and component failure), resulting from the steadily increasing machine size and decreasing operating voltage. Indeed, hardware trends predict at least two faults per minute for next generation (exascale) supercomputers.

The challenge of ascertaining fault tolerance for high-performance computing is not new, and has been the focus of extensive research for over two decades. However, most solutions are either (i) general purpose, requiring little to no algorithmic effort, but severely degrading performance (e.g. checkpoint-restart), or (ii) tailored to specific applications and very efficient, but requiring high expertise and significantly increasing programmers' workload. We seek the best of both worlds: high performance and general purpose fault resilience.

Efficient general purpose solutions (e.g. via error correcting codes) have revolutionized memory and communication devices over two decades ago, enabling programmers to effectively disregard the very
likely memory and communication errors. The time has come for a similar paradigm shift in the computing regimen. I argue that exciting recent advances in error correcting codes, and in short probabilistically checkable proofs, make this goal feasible. Success along these lines will eliminate the bottleneck of required fault-tolerance expertise, and open exascale computing to all algorithm designers and programmers, for the benefit of the scientific, engineering, and industrial communities.

Fields of science (EuroSciVoc)

CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

You need to log in or register to use this function

Keywords

Project’s keywords as indicated by the project coordinator. Not to be confused with the EuroSciVoc taxonomy (Fields of science)

Programme(s)

Multi-annual funding programmes that define the EU’s priorities for research and innovation.

Topic(s)

Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

Funding Scheme

Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.

ERC-COG - Consolidator Grant

See all projects funded under this funding scheme

Call for proposal

Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

(opens in new window) ERC-2018-COG

See all projects funded under this call

Host institution

THE HEBREW UNIVERSITY OF JERUSALEM
Net EU contribution

Net EU financial contribution. The sum of money that the participant receives, deducted by the EU contribution to its linked third party. It considers the distribution of the EU financial contribution between direct beneficiaries of the project and other types of participants, like third-party participants.

€ 1 824 467,00
Address
EDMOND J SAFRA CAMPUS GIVAT RAM
91904 JERUSALEM
Israel

See on map

Activity type
Higher or Secondary Education Establishments
Links
Total cost

The total costs incurred by this organisation to participate in the project, including direct and indirect costs. This amount is a subset of the overall project budget.

€ 1 824 467,00

Beneficiaries (1)

My booklet 0 0