Deeply Explainable Intelligent Machines

Explanations are valuable because they scaffold the kind of learning that supports adaptive behaviour, e.g. explanations enable users to adapt themselves to the situations that are about to arise. Explanations allow us to attain a stable environment and have the possibility to control it, e.g. explanations put us in a better position to control the future. Explanations in the medical domain can help patients identify and monitor the abnormal behaviour of their ailment. In the domain of self-driving vehicles they can warn the user of some critical state and collaborate with her to prevent a wrong decision. In the domain of satellite imagery, an explanatory monitoring system justifying the evidence of a future hurricane can save millions of lives. Hence, as a summary, a learning machine that a user can trust and easily operate need to be fashioned with the ability of explanation. Moreover, according to GDPR, an automatic decision maker is required to be transparent by law.

As decision makers, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. In contrast, artificially intelligent systems are frequently seen as opaque and are unable to explain their decisions. This is particularly concerning as ultimately such systems fail in building trust with human users.

In this project, the goal is to build a fully transparent end-to-end trainable and explainable deep learning approach for visual scene understanding. To achieve this goal, we will make use of the positive interactions between multiple data modalities, incorporate uncertainty and temporal continuity constraints, as well as memory mechanisms. The output of this project can have direct consequences for many practical applications, most notably in mobile robotics and intelligent vehicles industry. This project will therefore strengthen the user’s trust in a very competitive market.

The project is composed of three parts and each part aims to answer the following questions:
1. How can we build interpretable decision agents that are capable of explaining their decisions through natural language for a variety of vision tasks?
2. How can we develop such an approach with reduced external supervision?
3. How can we enable a human interpretable communication between multiple agents that work together to solve a common objective?

We study these research challenges by combining computer vision, natural language processing and machine learning and integrate multiple modalities, e.g. vision and text, to develop end-to-end deep explainable AI systems. In this reporting period, we have investigated three core computer vision, machine learning and natural language processing tasks that are essential to generate explanations.

The first task was concerned in generating introspective explanations is via visual attention. Visual attention that we developed in [a] constrains the reasons for the decision but does not tie specific actions to specific input regions, e.g. provides the region the model is looking at while pointing to the attributes and noun phrases that concept corresponds to. In addition to generating visual explanations in the form of attention, we developed a benchmark, proposed an improved method for generating text-based explanations and a dataset in the context of visual question answering in [b].

For the second task of learning with reduced supervision, we constructed recognition models for unseen target decisions that have not been labeled for training. The methods we developed here operate on a limited data regime and they associate observed and non observed classes through some form of auxiliary information which encodes visually distinguishing properties of objects [c,d,e]. By learning these encodings, a decision maker also learns representations that are transferable across decisions and tasks. Our recent collaboration publication [f] applied the methods developed in this sub-category in large-scale insect identification which is a very important task in environmental preservation.

Finally, to enable a human interpretable communication between multiple agents that work together to solve a common objective, we argue that often a set of agents must cooperate to complete a task with the ability to communicate through a noisy channel of limited bandwidth. Incorporating pragmatics, i.e. reasoning about the contextual information behind a message, into both learning and inference can improve performance to solve the family of tasks. In this project so far, we have explored and developed emergent communication protocols such as language [g] and abstract primitives [h] in multi-agent settings [i].

References:
[a] Keep Calm and Improve Visual Feature Attribution, Kim etal ICCV 2021
[b] e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks, Kayser etal, ICCV 2021
[c] Learning Graph Embeddings for Open-World Compositional Zero-Shot Learning, Mancini etal, IEEE TPAMI 2022
[d] Fine-Graned Zero-Shot Learning with DNA as Side-Information, Badirli etal, NeurIPS 2021
[e] Audio-Visual Generalized Zero-Shot Learning with Cross-Modal Attention and Language, Mercea etal, IEEE CVPR 2022
[f] Classifying the Unknown: Insect identification with deep hierarchical Bayesian learning, Badirli etal, Methods in Ecology and Evolution, 2023
[g] Learning Decision Trees Recurrently Through Communication, Alaniz etal, IEEE CVPR 2021
[h] Abstracting Sketches through Simple Primitives, Alaniz etal, ECCV 2022
[i] Modeling Conceptual Understanding in Image Reference Games, Corona etal, NeurIPS 2019

Large language models are transforming research on machine learning while galvanizing public debates. Such a rapid progress in this field was hard to anticipate when this proposal was written. Currently our research is being transformed by the disruptive effects of these models in our daily lives. On the one hand, understanding not only when these models work well and succeed but also why they fail and misbehave is of great societal relevance. Therefore, our future research will continue focusing on revealing the internal thought process of such models as well as understanding the connection between its input and its output. On the other hand, in order to understand how different these models compared to humans, we have decided to collaborate with researchers at Max Planck Institute for Biological Cybernetics in Tübingen and turn the lens of computational psychiatry, a framework used to computationally describe and modify aberrant behavior, to the outputs produced by these models. Finally, the funding provided by this grant has allowed us to build further collaborations with the university hospital to contribute to research to understand neurological diseases. We are hopeful that until the end of the funding period we will be able to demonstrate further results in these fields with high societal relevance such as medicine.

Periodic Reporting for period 3 - DEXIM (Deeply Explainable Intelligent Machines)

Share this page

Download