Expectational Visual Artificial Intelligence

Periodic Reporting for period 2 - EVA (Expectational Visual Artificial Intelligence)

Periodo di rendicontazione: 2022-06-01 al 2023-11-30

- Regarding the team

The project started with a delay of 9-12 months, after agreement with the project managers due to COVID hiring problems. The research team comprises Haochen Wang, Ilze Amanda Auzina, Leonard Bereska and a postdoctoral researcher will be hired now (vacancy closes in September 2023).

- What is the problem/issue being addressed?

Most Machine Learning applications have focused on static problems, e.g. classification or regression. However, the majority of real-world problems are dynamic, such as videos recording everyday activities and events, or videos of scientific recordings. Hence, the project's overall objective is to learn true dynamics from observation data.

- Why is it important for society?

Learning from data is becoming increasingly relevant with AI being used in climate, molecular dynamics, or robotic models. What is more, static data and models trained on them are generally easier to control. The reason is that they correspond to much smaller amounts of data and it is easier to do a near exhaustive examination of the model behavior. However, as we move to dynamic data, the same models are not a good fit because the assumptions made for static data (limited number of appearance variation, limited number of correlation patterns, near-stationarity) is not guaranteed. This means that not only existing models in the literature will not work well but also that it would be harder to ensure their safety and reliability. As in the recent years, both at the scientific and societal (EU, World) level there is an increasing care for AI safety, it is important to have reliable models for dynamic data, which comprise the vast majority in real life.

- What are the overall objectives?

The ob in this project can be divided into temporal machine learning objectives, temporal computer vision objectives, and temporal AI safety objectives.

Regarding temporal machine learning objectives, we will explore: learning dynamics (i) from a Bayesian perspective with informative priors; (ii) with continuous latent models that account for time invariant information; or learning a mixture of non-linear dynamics (iii) with transformer inspired dynamic’s slots. Regarding temporal computer vision objectives, with consumer videos recording events and activities, understanding all that is happening in the video is critical, from recognizing objects and their instances, to actions to events (complex actions) to interactions (actions between objects). Crux to that is Video Instance Segmentation (VIS) aims at segmenting and categorizing objects in videos from a closed set of training categories, lacking the generalization ability to handle novel categories in real-world videos. In terms of temporal AI safety objectives, to ensure control, the overall objectives include mechanistic interpretability and goal formation emergence.

Regarding temporal machine learning, we first explored learning dynamics from a Bayesian approach with a Gaussian Process (GP) model. We proposed a new generative model where the differential equation is modelled in a latent space with GP, which kernel function captures a prior bias of the true dynamics. The proposed method outperformed related models in the field and opened a new modelling approach. Subsequently, we designed a framework, modulated NODEs, that allows to disentangle time-variant from time-invariant factors and leads to improved generalization to new dynamics as well as improved forecasting capabilities. Currently we are working on learning switching non-linear dynamics from high dimensional data with transformers. A preliminary version of the work was published in a NeurIPS 2022 workshop 'Latent GP-ODEs with Informative Priors', and the full version of the work is currently in review in NeurIPS 2023.

Regarding temporal computer vision, we first collected videos containing objects from 384 categories to make a large-vocabulary video instance segmentation dataset (LVVIS) with extremely large category size of 1,212. We tried several different methods to tackle our proposed Open-Vocabulary Video Instance Segmentation, and achieves promising performance with MindVLT. We first tried a straightforward approach for Open-Vocabulary VIS by associating per-frame results of open-vocabulary detectors with open-world trackers. However, this propose-reduce-association approach desires intricate hand-crafted modules such as non-maximum suppression, and neglects video-level features for stable tracking and open-vocabulary classification in videos. To improve speed and accuracy we propose the first end-to-end Open-Vocabulary VIS model, MindVLT, which simplifies the intricate propose-reduce-association paradigm and attains long-term awareness with a Memory-Induced Vision-Language Transformer. Specifically, it starts by proposing and segmenting all objects with a Universal Object Proposal module, then a set of Memory Queries are introduced to incrementally encode object features through time, enabling long-term awareness for efficiently tracking all objects through time. Lastly, given arbitrary category names as input, a language transformer encoder is adopted to classify the tracked objects based on the Memory Queries. The Memory Queries aggregate the object features from different frames, thus leading to robust video object classification. To our knowledge, MindVLT is the first end-to-end model with the capability of segmenting, tracking, and classifying objects in videos from arbitrary open-set categories with near real-time inference speed. This work was accepted to be published with an oral presentation in ICCV 2023.

Regarding temporal AI safety, we first focused on continual learning of dynamical systems. The project researched continual learning of dynamical systems using Competitive Federated Reservoir Computing. The research resulted in a published paper titled "Continual Learning of Dynamical Systems with Competitive Federated Reservoir Computing." This paper introduces a novel approach to continual learning based on reservoir computing and competitive prediction heads. The results demonstrated the approach's effectiveness in minimizing interference and catastrophic forgetting in various dynamical systems.

Regarding the temporal machine learning work, All the models proposed have outperformed the existing related models of Neural ODEs and improved variant in the field. More importantly introduced a new modelling perspective for dynamical systems. The expected results until the project end are to extend current work to transformer-based architectures and increase application domain.

Regarding temporal computer vision, the proposed methods have gained clear improvements over the state-of-the-art including FEELVOS, MaskTrack, SipMask, Mask2Former, DETIC. The next step is to include temporal interactivity and connect the methods with Foundation and Large Language Models.

Regarding temporal AI safety, we plan to have by the end acquired novel interpretability techniques. The goal is to enhance the interpretability of transformer architectures trained with reinforcement learning. These techniques will go beyond existing methods, providing deeper insights into the decision-making processes of these models. Particularly important are the discovery of divergent goals: The project seeks to make advancements in uncovering and understanding divergent goals exhibited by reinforcement learning-trained transformers.

invariantnode-2.jpg

ovvis-example.png

Periodic Reporting for period 2 - EVA (Expectational Visual Artificial Intelligence)

Condividi questa pagina

Scarica