Skip to main content

Predictably Dependable Computing Systems

Objective

The long-term objective of PDCS was to produce a design support environment well populated with tools and ready-made system components that fully supports the notion of the predictably dependable design of large real-time fault-tolerant distributed system s. The Action aimed to develop effective techniques for establishing realistic dependability requirements; produce quantitative methods for measuring and predicting the dependability of complex software/hardware systems; and incorporate such methods more fully into the design process, making it much more controlled and capable of allowing design decisions to be based on meaningful analyses of risks and quantified likely benefits.
The process of designing and constructing adequately dependable computing systems was investigated with the aim of making this much more predictable and cost effective than at present. Particular attention was paid to the problems of producing dependable distributed real time systems in which the dependability is achieved by incorporating mechanisms for fault tolerance. An important goal of the research was to facilitate the increased use of quantitative assessments of system dependability as a prerequisite to turning the construction of large computer based systems into a true engineering discipline. To this end great stress was placed on the necessity of taking a systems engineering approach.

The problem of the use of fault injection for testing algorithms and mechanisms implementing fault tolerance with respect to specific inputs is addressed. A global characterization of the fault injection attributes (the faults, the activation, the readouts and the measures) is presented with respect to the various levels of abstraction of representation of the target system used in the development process (axiomatic, empirical and physical models) and the validation objectives (fault removal and fault forecasting). The fault forecasting issues are addressed specifically and an evaluation method is proposed that implements the link between the analytical modelling approaches (Monte Carlo simulations, closed-form expressions, Markov chains) used for the representation of the fault occurrence process and the experimental fault injection approaches (fault simulation and physical injection) characterizing the error processing and fault treatment provided by the fault tolerance algorithms and mechanisms. An approach for the characterization of the faults to be injected for uncovering design/implementation faults in the fault tolerance algorithms and mechanisms has been investigated.

Experience of applying reliability models in the past has shown that the relative predictive performance of the models depends entirely of the context. It has been found that there is no one model that performs well over all data sets. It has also been found that for some data sets all models are in error. In such cases 2 techniques for improving predictive accuracy have been shown to be beneficial. Firstly recalibrating the raw model predictions and secondly using the results of trend tests before applying the models. These 2 techniques may be used separately or in combination. A number of reliability models were applied to failure data collected from a workstation together with the recalibration technique, and the performance of the resulting prediction systems was assessed.
APPROACH AND METHODS
Work was structured into three interrelated tasks:
-Dependability Concepts and Terminology, investigating the conceptual foundations of the provision of dependability, particularly of predictable dependability, with the aim of developing a systematic, unified approach to the design of fault-tolerant syst ems in order to facilitate the assessment of their dependability characteristics.
-Specification and Design for Dependability, concerned with gaining an understanding of the principles involved in constructing a development environment for PDCS and an exploration of a number of issues concerned with system specification and design ofparticular importance for dependability (largely ignored by the mainstream of formal specification techniques).
-Dependability Prediction and Evaluation, concerned with basic techniques for measurement and prediction, with particular attention to their application to fault-tolerant systems and software, and to issues of validation.
PROGRESS AND RESULTS
-Dependability Concepts and Terminology: PDCS has been instrumental in developing the concepts and terminology of dependability through involvement in IFIP WG10.4 and other forums.
-Specification and Design for Dependability: Work on possible means of modelling the effect of the operations environment on the particular dependability attribute of reliability focused on the choice of probabilistic model. Work on specification and des ign for timeliness includes: i. comparison of event-triggered and time-triggered real-time systems, ii. scheduling in real-time systems, iii. testing real-time applications software. Problem of non-functional requirements (NFRs) investigated to provide some taxonomic principles for their classification. In fault tolerance, the emphasis was on tolerating software faults, including i. a systematic survey of the techniques for software fault tolerance, ii. design schemes and notations for structuring software (or software-implemented) fault-tolerance, iii. a technique for producing fault-tolerant software from algebraic specifications, iv. the application of fault tolerance methods to intentional security violations, v. systematic engineering methods for guid ng decisions in the design of fault-tolerant systems. Sub-problems addressed include: i. achieving a high integrity development process, with support from an SEE, ii. techniques for developing highly reliable parallel programs for distributed systems, iii. the design and development of real-time systems, iv. environments for the production of secure application systems.
-Dependability Prediction and Evaluation: Work focused on: i. a generalisation of the classical, hardware-oriented, reliability theory to cover software, and of the reliability growth models to cover hardware, ii. an extension of classical multiprocessor models to cover multiple simultaneous breakdowns and/or repairs, iii. modelling of the Recovery Blocks, N-Version Programming and N Self-Checking Programming methods for software-fault tolerance, iv. using testing to estimate parameters for recovery bloc k models, v. extending reliability growth modelling from black box to multi-component systems, vi. a general technique for recalibrating software reliability models which improves the accuracy of reliability predictions under a wide class of circumstances, vii. an investigation showing how reliability trend analysis can help the designer in controlling the progress of the development activities, viii. a study of the basic issues of measurement of software process and product. On analysis of timeliness, wo rk has centred on: i. the prediction of the response time of real-time tasks, ii. dependable communication protocols. Work on statistical assessment investigated: i. theoretical bases to quantify the quality of statistical testing with respect to a given criterion (or set of criteria), including experiments using nuclear reactor control software, ii. methods of fault injection for fault tolerance validation. In the area of security evaluation, the issues of the theoretical bases of possible measures of op erational security and the practical evaluation of security products have been pursued. Regarding ultra-high dependability, we have investigated: i. limits of the various techniques available for validation (reliability growth models, testing with stable reliability, structural dependability modelling and informal arguments based on good engineering practice), ii. the possibility of augmenting these techniques by explicitly taking account of the process in the quantitative assessment, iii. advantages and imits of formal approaches to software development for achieving ultra-high dependability of safety-critical computing systems, iv. requirements analysis for safety-critical systems.
POTENTIAL
All the results obtained are freely available for use by industry. Many partners are involved in the provision of short industrial courses, facilitating industrial take-up. A number of direct links have been set up with various applied industrial research projects, including ESPRIT's FASST (5212), which plans to use some of the methods for modelling and evaluation of dependability that have been developed under PDCS to assist the design of the FASST prototype multiprocessor. There is now an active program me of exploitation of SoRel, the tool for software reliability analysis and prediction developed by LAAS; it will be distributed by CEP Systmes.

Coordinator

UNIVERSITY OF NEWCASTLE UPON TYNE
Address

NE1 7RU Newcastle Upon Tyne
United Kingdom

Participants (7)

Centre National de la Recherche Scientifique (CNRS)
France
Address
7 Avenue De Colonel Roche
31077 Toulouse
City University
United Kingdom
Address
Northampton Square
EC1V 0HB London
NATIONAL RESEARCH COUNCIL OF ITALY
Italy
Address
Via G. Moruzzi 1
56124 Pisa
TECHNISCHE UNIVERSITAT WIEN-IAIS
Austria
Address
Treitlstrasse, 3
1040 Wien
UNIVERSITÄT KARLSRUHE
Germany
Address
Haid-und-neu-straße 7
76131 Karlsruhe
University of York
United Kingdom
Address
Heslington -
YO1 5DD York
Université de Paris XI (Université Paris-Sud)
France
Address
Avenue Georges Clémenceau
91405 Orsay