Skip to main content
Aller à la page d’accueil de la Commission européenne (s’ouvre dans une nouvelle fenêtre)
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Self-assessment Oracles for Anticipatory Testing

Periodic Reporting for period 4 - PRECRIME (Self-assessment Oracles for Anticipatory Testing)

Période du rapport: 2023-07-01 au 2024-06-30

Our society is increasingly relying on software that makes use of artificial intelligence for its operation. This happens when the software has to process images, natural language, speech and generally complex inputs. Software based on artificial intelligence (e.g. deep neural networks) is increasingly used in safety critical systems (e.g. self-driving cars, which use camera images and lidar to control steering wheel, throttle and brake), health-care applications (e.g. recommender systems used by medical doctors) or business critical domains (e.g. financial data analytics). Hence, its dependability and reliability are becoming a major concern for the society.

The problem tackled by the ERC project Precrime is ensuring the dependability and quality of software systems that possibly face unexpected execution conditions. This is definitely the case of artificial intelligence based software, as the environment where such software operates features a huge range of variable conditions (consider, for instance, the wide range of driving conditions where a self-driving car may operate). The main objectives of Precrime are (1) the creation of a self-oracle and (2) the automation of testing under unexpected execution contexts. Specifically, Precrime has developed a self-oracle that can recognize unexpected execution conditions at run time and can heal the system from misbehaviors when such conditions occur. Precrime has also developed automated test generation techniques, to ensure an increased robustness of systems that may face unexpected execution scenarios.
The work conducted during the project can be organized into three major research areas:

Deep learning faults and mutation: in order to investigate the specific nature of faults that affect deep neural networks, we have performed a qualitative analysis of software forum discussions and code commit messages, and we have conducted semi-structured interviews. The acquired knowledge was organized into a taxonomy of real deep learning faults (ICSE 2020). Then, we have developed a deep learning mutation tool, called DeepCrime (ISSTA 2021), which can inject artificial faults into a deep learning component, mimicking the real faults described in our taxonomy. By simulating the occurrence of real faults in a deep learning component, we can assess the thoroughness of testing: new test cases should be created until all artificially injected faults are exposed by at least one test case. The augmented test set obtained thanks to the faults injected by DeepCrime ensures a higher degree of robustness of the system under test.

Test scenario generation: deep learning components produce unreliable outputs when operating on inputs that deviate from those used to train them. For instance, a self-driving car trained to operate exclusively in sunny conditions might misbehave when it is raining or snowing. Our approach to assess the reliability of a deep learning component finds automatically its frontier of behaviors, consisting of pairs of nearby conditions such that in one the component behaves properly while in the other it fails. In the self-driving car example, this could be the transition from sunny to rainy conditions. Our approach is implemented in a tool, called DeepJanus (FSE 2020), that can automatically find and report the frontier of behaviors. To support explainability and debugging of the failure scenarios, we developed an approach, called DeepHyperion (ISSTA 2021; TOSEM 2023), which provides a characterization of the failure conditions in the form of a feature map. Developers can use DeepHyperion's feature map to understand the precise combination of features that lead to a misbehaviour (e.g. a specific luminosity condition paired with a specific road shape might result occasionally in out of bound episodes of a self-driving car).

Misbehavior prediction: this line of research realizes Precrime’s vision of a self-oracle which can determine if a system based on artificial intelligence components is facing unexpected execution conditions under which it should be safely disengaged to avoid damages. For instance, a self-driving car facing an unexpected driving scenario, which requires the activation of a safe disengagement procedure. We took both a black-box and a white-box approach to the self-oracle problem. In the black-box approach we consider only the input to the deep learning component and we assess its proximity to the training data by means of autoencoders. When the input deviates from the training data, the safe oracle activates safe disengagement. This approach was presented at the conference ICSE 2020. In the white-box approach, the internals of the deep learning component are inspected to obtain measurements of uncertainty. When the component is highly uncertain about its output, again a safe disengagement procedure is activated. The white-box techniques to measure uncertainty are implemented within UncertaintyWizard, a tool that we presented at the conference ICST 2021.
Our taxonomy of real faults and our mutation tool DeepCrime represent a major step forward in the comprehension and injection of deep learning faults. Our taxonomy is the first to incorporate the output obtained from interviews with professional deep learning developers and our tool is the first to inject faults that simulate real faults into a deep learning component. In fact, prior to DeepCrime, deep learning mutation was limited to random manipulations of the parameters of a neural network, with no connection with the effect that a real fault may have on them. We also developed a new test generator, called DeepMetis, to automatically produce test inputs that expose real faults injected by DeepCrime. A test suite augmented with such automatically generated inputs is much stronger than the original one in assessing the robustness of the deep learning component under test.

Automated input generation for deep learning components available from the state of the art does not consider any notion of frontier of behaviors. As a consequence, the critical inputs that are generated automatically by existing tools might look unrealistic and far from the nominal execution conditions faced by the component under test. On the contrary, with our approach we can find the frontier inputs that separate the expected behavior from a misbehavior, hence providing developers with a very clear indication and assessment of the region where the system starts to become unreliable. We also assessed the frontier of behaviors of systems that continue to learn after deployment, by means of algorithms such as reinforcement learning. We also investigated novel techniques, based on feature maps, to provide developers with explanations about the input features that make a deep learning component misbehave.

With Self-Oracle, we have been the first researchers to experiment a self-healing solution based on auto-encoders and uncertainty quantification on complex systems, such as a self-driving car, that include deep learning components. Moreover, our publicly available tool UncertaintyWizard has provided the research community with a solid, optimized and well-documented solution for uncertainty estimation. We have also ported Self-Oracle to environments where continuous adaptation to changes is required and we have investigated novel hybrid black-box and white-box solutions to the self-oracle problem.
Schematic representation of Precrime's Self-Oracle
Mon livret 0 0