European Commission logo
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

Robust, Explainable Deep Networks in Computer Vision

Periodic Reporting for period 2 - RED (Robust, Explainable Deep Networks in Computer Vision)

Okres sprawozdawczy: 2022-03-01 do 2023-08-31

Deep learning approaches based on so called deep neural networks (DNNs) have taken the field of computer vision by storm. They have enabled the automatic analysis and understanding of digital images and video by algorithmic means. While the progress in recent years has been astounding, it would be incorrect to believe that important problems in computer vision have already been solved. Large amounts of training data are required for training such models, yet despite this, the resulting DNNs only have limited robustness, i.e. they work very well in the scenarios they have been trained on but do not generalize nearly as well to novel, related scenarios. In addition, the majority of deep neural networks in computer vision show deficiencies in terms of explainability. That is, the role of neural network components is often opaque and most DNNs in vision do not output reliable quantifications of the uncertainty of the prediction. This limits comprehension by potential users, reduces user trust, and limits the impact of computer vision solutions in critical real-world applications.

In this project, we aim to significantly advance deep neural networks in computer vision toward improved robustness and explainability. To that end, we will investigate structured network architectures, probabilistic methods, explainable AI techniques, and hybrid generative/discriminative models, all with the goal of increasing robustness and gaining explainability. This is accompanied by research on how to assess robustness and aspects of explainability via appropriate datasets and metrics. While we aim to develop a toolbox that is as independent of specific image and video analysis tasks as possible, the work program is grounded in concrete vision problems, e.g. scene understanding and motion estimation, to monitor progress. We expect the project to have significant impact in applications of computer vision where robustness is key, data is limited, and user trust is paramount.
First, with regards to robustness, we have developed a variety of deep neural network architectures that combine standard feedforward neural networks with components or insights from traditional computer vision approaches. This has allowed us to devise more robust, structured neural architectures that significantly improve the robustness in concrete image or scene analysis tasks. For example, we have contributed novel robust deep learning approaches to 3D motion estimation from videos, image deblurring, or automatic image captioning. Moreover, combining standard DNNs with traditional components has enabled us to propose new neural architectures that allow to trade-off the computational efficiency and the accuracy.

Second, we have advanced the explainability of DNNs in computer vision in various ways. On the one hand, the novel structured neural architectures discussed above improve the inherent explainability of these models. Beyond this, we have developed practical tools to estimate uncertainties in DNNs in highly efficient ways, allowing to quantify the uncertainty of the prediction as well as to better understand the inner workings of existing neural architectures. Additionally, we have developed highly practical approaches for obtaining post-hoc explanations from deep neural networks, offering not only the possibility to better understand existing deep neural networks but also to train new neural networks that suppress undesirable behavior such as predictions that are not well aligned with human values or human comprehension.

Third, we have worked on comprehensive benchmarking methodologies and created novel datasets, which allow to assess the quality of explainable AI algorithms as well as certain inherently explainable DNNs in computer vision in a quantitative fashion.
The project has advanced the state-of-the-art in robust, explainable deep learning approaches to computer vision in various directions. In the direction of concrete computer vision tasks, we have, for example, developed robust deep models that allow to estimate the structure of 3D scenes as well as the 3D motion of the individual objects in the scene robustly from a monocular video. Moreover, we have devised new neural architectures that allow to remove blur from digital images as it may, for example, arise from camera shake and achieved significantly higher visual fidelity than before, including in scenarios that do not completely adhere to the assumptions used to construct the model. Another noteworthy result advances the state-of-the-art in automatic image captioning, i.e. the task of automatically generating textual descriptions of images. Our novel models are more robust in the sense that they work more reliably on images that differ from those used for training; also, unlike previous work, our approach can generate not only generate a single “static” description of the scene but rather multiple, diverse textual descriptions that exhibit a human-like variability.

In the direction of technical foundations, we have for example developed algorithms that allow to estimate the uncertainty of both the predictions as well as of the “inner workings” of a deep neural network in highly practical ways. We have also devised highly efficient algorithms for computing so-called feature attributions, which can be understood as visual maps that highlight to a human user which parts of the input image were primarily responsible for a certain prediction of a deep neural network. This makes estimating uncertainties and feature attributions much more practical than before, allowing to shed significantly more light into otherwise rather opaque deep neural networks for a variety of image and video analysis tasks.

In the remainder of the project, we will specifically aim to address more complex scene analysis scenarios, such as a joint reconstruction and semantic analysis of 3D scenes. Moreover, we will leverage our previous work on estimating uncertainties in deep neural networks to reduce the dependence of deep neural networks on large amounts of training data, such as by improving the effectiveness of transfer learning. Another direction will be to significantly expand the scope of our efforts for an in-depth understanding of explainable AI techniques in computer vision. In a similar vein, we will also investigate how explainable AI techniques can be applied to much larger families of computer vision models with the expectation that we help to significantly increase user trust in critical applications of computer vision.