Self-assessment Oracles for Anticipatory Testing

Periodic Reporting for period 3 - PRECRIME (Self-assessment Oracles for Anticipatory Testing)

Período documentado: 2022-01-01 hasta 2023-06-30

Our society is increasingly relying on software that makes use of artificial intelligence for its operation. This happens when the software has to process images, natural language, speech and generally complex inputs. Software based on artificial intelligence (e.g. deep neural networks) is increasingly used in safety critical systems (e.g. self-driving cars, which use camera images to control the steering wheel), health-care applications (e.g. recommender systems used by medical doctors) or business critical domains (e.g. financial data analytics). Hence, its dependability and reliability are becoming a major concern for the society.

The problem tackled by the ERC project Precrime is ensuring the dependability and quality of software systems that possibly face unexpected execution conditions. This is definitely the case of artificial intelligence based software, as the environment where such software operates features a huge range of variable conditions (consider, for instance, the wide range of driving conditions where a self-driving car may operate). The main objectives of Precrime are (1) the creation of a self-oracle and (2) the automation of testing under unexpected execution contexts. Specifically, Precrime is developing a self-oracle that can recognize unexpected execution conditions at run time and can heal the system from misbehaviors when such conditions occur. Precrime is also developing automated test generation techniques, to ensure an increased robustness of systems that may face unexpected execution conditions.

The work conducted in the first half of the project can be organized into three major research areas:

Deep learning faults and mutation: in order to investigate the specific nature of faults that affect deep neural networks, we have performed a qualitative analysis of software forum discussions and code commit messages, and we have conducted semi-structured interviews. The acquired knowledge was organized into a taxonomy of real deep learning faults, which was presented at the conference ICSE 2020. Then, we have developed a deep learning mutation tool, called DeepCrime, which can inject artificial faults into a deep learning component, mimicking the real faults described in our taxonomy. By simulating the occurrence of real faults in a deep learning component, we can assess the thoroughness of testing: new test cases should be created until all artificially injected faults are exposed by at least one test case. The augmented test set obtained thanks to the faults injected by DeepCrime ensures a higher degree of robustness of the system under test. DeepCrime was presented at the conference ISSTA 2021.

Frontier of behaviors: deep learning components produce unreliable outputs when operating on inputs that deviate from those used to train them. For instance, a self-driving car trained to operate exclusively in sunny conditions might misbehave when it is raining or snowing. Our approach to assess the reliability of a deep learning component finds automatically its frontier of behaviors, consisting of pairs of nearby conditions such that in one the component behaves properly while in the other it fails. In the self-driving car example, this could be the transition from sunny to rainy conditions. Our approach is implemented in a tool, called DeepJanus, that can automatically find and report the frontier of behaviors. The output of DeepJanus helps developers identify the conditions when misbehaviors are possible. If such conditions belong to the input validity domain, they can actually occur in the field, which indicates the need for actions to improve the quality of the system. On the contrary, when misbehaviors happen only in unrealistic conditions, our method provides confidence on the system’s dependability. DeepJanus was presented at the conference ESEC/FSE 2020.

Misbehavior prediction: this line of research realizes Precrime’s vision of a self-oracle which can determine if a system based on artificial intelligence components is facing unexpected execution conditions under which it should be safely disengaged to avoid damages. For instance, a self-driving car facing an unexpected driving scenario, which requires the activation of a safe disengagement procedure. We took both a black-box and a white-box approach to the self-oracle problem. In the black-box approach we consider only the input to the deep learning component and we assess its proximity to the training data by means of autoencoders. When the input deviates from the training data, the safe oracle activates safe disengagement. This approach was presented at the conference ICSE 2020. In the white-box approach, the internals of the deep learning component are inspected to obtain measurements of uncertainty. When the component is highly uncertain about its output, again a safe disengagement procedure is activated. The white-box techniques to measure uncertainty are implemented within UncertaintyWizard, a tool that we presented at the conference ICST 2021.

Our taxonomy of real faults and our mutation tool DeepCrime represent a major step forward in the comprehension and injection of deep learning faults. Our taxonomy is the first to incorporate the output obtained from interviews with professional deep learning developers and our tool is the first to inject faults that simulate real faults into a deep learning component. In fact, prior to DeepCrime, deep learning mutation was limited to random manipulations of the parameters of a neural network, with no connection with the effect that a real fault may have on them. In our future work, we plan to automatically generate test inputs that expose real faults injected by DeepCrime. We expect a test suite augmented with such automatically generated inputs to be much stronger than the original one in assessing the robustness of the deep learning component under test.

Automated input generation for deep learning components available from the state of the art does not consider any notion of frontier of behaviors. As a consequence, the critical inputs that are generated automatically by existing tools might look unrealistic and far from the nominal execution conditions faced by the component under test. On the contrary, with our approach we can find the frontier inputs that separate the expected behavior from a misbehavior, hence providing developers with a very clear indication and assessment of the region where the system starts to become unreliable. In our future work we will assess the frontier of behaviors of systems that continue to learn after deployment, by means of algorithms such as reinforcement learning. We will also investigate techniques, based on feature maps, to provide developers with explanations about the input features that make a deep learning component misbehave.

With Self-Oracle, we have been the first researchers to experiment a self-healing solution based on auto-encoders on a complex system such as a self-driving car that includes deep learning components. Moreover, our publicly available tool UncertaintyWizard has provided the research community with a solid, optimized and well-documented solution for uncertainty estimation. In our future work we will port the Self-Oracle to environments where continuous adaptation to changes is required. We will also investigate novel hybrid black-box and white-box solutions to the self-oracle problem.

Schematic representation of Precrime's Self-Oracle

Periodic Reporting for period 3 - PRECRIME (Self-assessment Oracles for Anticipatory Testing)

Compartir esta página

Descargar