The objective of FTMPS is to develop techniques and system software capable of accommodating component failures in massively parallel computers in order to permit extremely long executions of application code, where a real-time response is not required.
A transputer-based system, featuring redundant processor nodes, a fault-tolerant communications network architecture and an independent network of control processors provides the environment for the work.
The project examines:
concurrent failure detection on a node and system basis
checkpointing and restart of applications
post-failure recovery behaviour
quantitative failure modelling.
PR1 1RE Preston