Periodic Reporting for period 3 - DEXIM (Deeply Explainable Intelligent Machines)
Reporting period: 2023-07-01 to 2023-12-31
As decision makers, humans can justify their decisions with natural language and point to the evidence in the visual world which led to their decisions. In contrast, artificially intelligent systems are frequently seen as opaque and are unable to explain their decisions. This is particularly concerning as ultimately such systems fail in building trust with human users.
In this project, the goal is to build a fully transparent end-to-end trainable and explainable deep learning approach for visual scene understanding. To achieve this goal, we will make use of the positive interactions between multiple data modalities, incorporate uncertainty and temporal continuity constraints, as well as memory mechanisms. The output of this project can have direct consequences for many practical applications, most notably in mobile robotics and intelligent vehicles industry. This project will therefore strengthen the user’s trust in a very competitive market.
1. How can we build interpretable decision agents that are capable of explaining their decisions through natural language for a variety of vision tasks?
2. How can we develop such an approach with reduced external supervision?
3. How can we enable a human interpretable communication between multiple agents that work together to solve a common objective?
We study these research challenges by combining computer vision, natural language processing and machine learning and integrate multiple modalities, e.g. vision and text, to develop end-to-end deep explainable AI systems. In this reporting period, we have investigated three core computer vision, machine learning and natural language processing tasks that are essential to generate explanations.
The first task was concerned in generating introspective explanations is via visual attention. Visual attention that we developed in [a] constrains the reasons for the decision but does not tie specific actions to specific input regions, e.g. provides the region the model is looking at while pointing to the attributes and noun phrases that concept corresponds to. In addition to generating visual explanations in the form of attention, we developed a benchmark, proposed an improved method for generating text-based explanations and a dataset in the context of visual question answering in [b].
For the second task of learning with reduced supervision, we constructed recognition models for unseen target decisions that have not been labeled for training. The methods we developed here operate on a limited data regime and they associate observed and non observed classes through some form of auxiliary information which encodes visually distinguishing properties of objects [c,d,e]. By learning these encodings, a decision maker also learns representations that are transferable across decisions and tasks. Our recent collaboration publication [f] applied the methods developed in this sub-category in large-scale insect identification which is a very important task in environmental preservation.
Finally, to enable a human interpretable communication between multiple agents that work together to solve a common objective, we argue that often a set of agents must cooperate to complete a task with the ability to communicate through a noisy channel of limited bandwidth. Incorporating pragmatics, i.e. reasoning about the contextual information behind a message, into both learning and inference can improve performance to solve the family of tasks. In this project so far, we have explored and developed emergent communication protocols such as language [g] and abstract primitives [h] in multi-agent settings [i].
References:
[a] Keep Calm and Improve Visual Feature Attribution, Kim etal ICCV 2021
[b] e-ViL: A Dataset and Benchmark for Natural Language Explanations in Vision-Language Tasks, Kayser etal, ICCV 2021
[c] Learning Graph Embeddings for Open-World Compositional Zero-Shot Learning, Mancini etal, IEEE TPAMI 2022
[d] Fine-Graned Zero-Shot Learning with DNA as Side-Information, Badirli etal, NeurIPS 2021
[e] Audio-Visual Generalized Zero-Shot Learning with Cross-Modal Attention and Language, Mercea etal, IEEE CVPR 2022
[f] Classifying the Unknown: Insect identification with deep hierarchical Bayesian learning, Badirli etal, Methods in Ecology and Evolution, 2023
[g] Learning Decision Trees Recurrently Through Communication, Alaniz etal, IEEE CVPR 2021
[h] Abstracting Sketches through Simple Primitives, Alaniz etal, ECCV 2022
[i] Modeling Conceptual Understanding in Image Reference Games, Corona etal, NeurIPS 2019