Skip to main content
European Commission logo
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Learning to See in a Dynamic World

Periodic Reporting for period 4 - SEED (Learning to See in a Dynamic World)

Période du rapport: 2020-07-01 au 2021-12-31

The objective of SEED is to extract a description that identifies the objects contained in a video, their precise boundaries and spatial layout, and the manner in which those objects move, interact and change over time, based on weakly supervised large-scale machine learning techniques.

The goal of SEED is to fundamentally advance the methodology of computer vision by exploiting a dynamic and active observer analysis perspective in order to acquire accurate, yet tractable models that can automatically learn to sense the visual world, localize still and animate objects (e.g. chairs, phones, computers, bicycles or cars, people and animals), actions and interactions between people, by propagating and consolidating temporal information, with minimal system training and supervision. For this purpose, SEED will develop novel high-order compositional methodologies for the semantic segmentation of video data acquired by observers of dynamic scenes, by adaptively integrating figure-ground reasoning based on bottom-up and top-down information, and by using weakly supervised machine learning techniques that support continuous learning towards an open ended number of visual categories.

The methodology emerging from this research has the potential to impact fields as diverse as automatic personal assistance for people, video editing and indexing, robotics, environmental awareness, augmented reality, human-computer interaction, or manufacturing.
The objective of SEED is to extract a description that identifies the objects contained in a video, their precise boundaries and spatial layout, and the manner in which those objects move, interact and change over time, based on weakly supervised large-scale machine learning techniques. The achievements attained during this reporting period follow the general project planning, and are as follows:

-- Semantic video segmentation. We developed models based on convolutional architectures and spatial transformer recurrent layer that are able to temporally propagate labeling information by means of optical flow, adaptively gated based on its locally estimated uncertainty. The flow, the recognition and the gated propagation modules can be trained jointly, end-to-end. The gated recurrent flow propagation component of our model can be plugged-into any static semantic segmentation architecture and turn it into a weakly supervised video processing one.

-- Active visual search. One of the most widely used strategies for visual object detection is based on exhaustive spatial hypothesis search. While methods like sliding windows have been successful and effective for many years, they are still brute-force, independent of the image content and the visual category being searched. In this line of work we have developed principled sequential models that accumulate evidence collected at a small set of image locations in order to detect visual objects effectively. By formulating sequential search for visual object categories as deep reinforcement learning of the search policy (including the stopping condition) and the detector response function, our fully trainable model can explicitly balance for each class, specifically, the conflicting goals of exploration – sampling more image regions for better accuracy –, and exploitation – stopping the search efficiently when sufficiently confident about the target’s location. The methodology is general and applicable to any detector response function.

-- Dynamic structured models for the detection, recognition (semantic segmentation) and 3d reconstruction of humans based on a multi-task architecture. We proposed a deep multitask architecture for fully automatic 2d and 3d human sensing (DMHS), including recognition and reconstruction, in monocular images. The system computes the figure-ground segmentation, semantically identifies the human body parts at pixel level, and estimates the 2d and 3d pose of the person. The model supports the joint training of all components by means of multi-task losses where early processing stages recursively feed into advanced ones for increasingly complex calculations, accuracy and robustness. The design allows us to tie a complete training protocol, by taking advantage of multiple datasets that would otherwise restrictively cover only some of the model components: complex 2d image data with no body part labeling and without associated 3d ground truth, or complex 3d data with limited 2d background variability.

-- Large-scale weakly supervised kernel methods based on Fourier approximation. We develop methodologies that allow for the first time, the application of non-linear, kernel-based semi-supervised learning methods (so far limited to datasets of only thousands of examples) to large scale data repositories of millions of datapoints.


-- Reconstructing of multiple interacting humans and of human interaction: Understanding 3d human interactions is fundamental for fine grained scene analysis and behavioural modeling. However, most of the existing models focus on analyzing a single person in isolation, and those who process several people focus largely on resolving multi-person data association, rather than inferring interactions. This line of work addressed such issues and made several contributions: (1) we introduce models for interaction signature estimation (ISP) encompassing contact detection, segmentation, and 3d contact signature prediction; (2) we show how such components can be leveraged in order to produce augmented losses that ensure contact consistency during 3d reconstruction;

-- Embodied Visual Active Learning for Semantic Segmentation: We stuied the task of embodied visual active learning, where an agent is set to explore a 3d environment with the goal to acquire visual scene understanding by actively selecting views for which to request annotation. While accurate on some benchmarks, earlier deep visual recognition pipelines tend to not generalize well in certain real-world scenarios, or for unusual viewpoints. Robotic perception, in turn, requires the capability to refine the recognition capabilities for the conditions where the mobile system operates, including cluttered indoor environments or poor illumination. This motivates the proposed task, where an agent is placed in a novel environment with the objective of improving its visual recognition capability. To study embodied visual active learning, wedevelop a battery of agents– both learnt and pre-specified– and with different levels of knowledge of the environment.
Progress beyond the state of the art has been achieved in the following areas

-- Weakly supervised semantic video segmentation , CVPR 2018

-- Deep reinforcement learning of region proposal networks for object detection, , CVPR 2018

-- Deep learning of graph matching, CVPR 2018

-- Appearance Transfer at CVPR 2018.

-- Embodied Active Learning, AAAI 2021

-- 3d human pose reconstruction of multiple people in monocular images and video, CVPR 2018,

-- 3d Reconstruction of human interactions including self-contact CVPR 2020, CVPR 2021

-- Generating Scenarios with Diverse Pedestrian Behaviors for Autonomous Vehicle Testing, CoRL 2021
Reinforcement learning for object detection
Weakly supervised semantic video segmentation