Skip to main content

Deep Learning for Dynamic 3D Visual Scene Understanding

Periodic Reporting for period 2 - DeeViSe (Deep Learning for Dynamic 3D Visual Scene Understanding)

Reporting period: 2019-10-01 to 2021-03-31

Visual scene understanding is one of the fundamental capabilities humans possess. Perceiving the 3D world around us, recognizing objects and people that populate it, and interpreting their motion is of prime importance for our daily lives. Consequently, it is also an extremely important capability for the development of intelligent machines, such as future service robots or autonomous vehicles. If such technical systems are to move through environments built for humans, they need to be aware of their surroundings and act accordingly.

Developing vision systems with advanced scene understanding capabilities has been a goal of computer vision research from its beginnings. The task is challenging, since it contains many different aspects, such as recognition, segmentation, tracking, human body pose estimation, and 3D reconstruction. In the past few years, deep learning has fundamentally revolutionized the way how those tasks are approached in computer vision, with deep neural networks (DNNs) taking up a primary role in practically all areas of vision. Nevertheless, current deep learning approaches for vision are still limited in that they most often address different vision tasks in isolation, while relying on large amounts of manually annotated supervised training data.

The goal of the DeeViSe project is to develop novel end-to-end deep learning approaches for dynamic visual scene understanding that break the boundaries between the different vision tasks in order to create vision systems that combine multiple vision modalities within a common deep learning framework. In addition, DeeViSe aims to impart deep learning approaches with a notion of what it means to move through a 3D world by incorporating temporal continuity constraints, as well as a persistence of 3D structure through associative and spatial neural memory mechanisms. Finally, DeeViSe addresses the question how deep neural networks can be successfully trained with reduced (external) supervision by investigating weakly supervised and self-supervised learning mechanisms, as well as more scalable workflows for human-assisted training data generation.

In order to study those research questions, the DeeViSe project investigates four core computer vision tasks that are essential for scene understanding: Pixel-level Semantic Scene Analysis, combining object recognition and segmentation (WP1), Deep Object Tracking (WP2), Dynamic Human Pose Analysis (WP3), and Deep Representations for 3D Reconstruction (WP4). These tasks are supported by activities on Learning with Seduced supervision (WP5) and Dataset Curation and Annotation (WP6).
Work in WP1 focused on developing end-to-end deep learning architectures for temporally consistent, pixel-level analysis of dynamic scenes. In particular, this research goal envisions a fusion of the previously separated areas of object detection, segmentation, and tracking. We were able to significantly push ahead this research direction under the label "Video Object Segmentation" (VOS) through a series of high-impact publications that advanced the state-of-the-art in VOS, including OnAVOS [BMVC'17], PReMVOS [ACCV'18], FeelVOS [CVPR'19], UnOVOST [WACV'20] and STEm-Seg [ECCV'20].

WP2 aims at building end-to-end deep learning approaches for single- and multi-object tracking. We made good progress towards this goal and developed state-of-the-art approaches for single- and multi-object tracking (SIAM-RCNN [CVPR'20], T2R-R2T [ICRA'20]). Moreover, we defined the task of "Multi-Object Tracking and Segmentation" (MOTS) [CVPR'19] and established it together with an annotated benchmark dataset and evaluation methodology. In order to popularize this task, we coorganized the MOTS-Challenge workshop at CVPR'20. We additionally developed the novel tracking evaluation methodology Higher-Order Tracking Accuracy (HOTA)[IJCV'20].

WP3 focuses on developing deep learning approaches for analyzing the body poses, activities, and interaction of humans. Here, we developed a lightweight, but very effective approach for 3D body pose estimation, MeTrO [FG'20]. We further extended this approach to also estimate the absolute 3D position and body pose of detected people in a multi-person setting (MeTrAbs) [IEEE TBIOM'21].

The goal of WP4 is to find ways to interface the geometry-centric representations from traditional Computer Vision methods with the semantic analysis capabilities of deep learning based vision modules. For this, we worked on developing deep learning representations for 3D point clouds and 3D geometry, resulting in state-of-the-art approaches for 3D semantic segmentation (DualConvMesh-Net [CVPR'20], Dilated Point Convolutions [ICRA'20]) and 3D instance segmentation (3D-BEVIS [GCPR'19], 3D-MPA [CVPR'20]).

WP5 addresses the problem of learning deep networks with reduced human supervision. Towards this goal, we 1) developed automated tools for interactive, human-assisted image segmentation [BMVC'18], as well as an automated workflow for human-assisted segmentation annotation of entire video datasets [WACV'21]; 2) developed approaches for category-agnostic multi-object tracking (CAMOT [ICRA'18], 4D-GVT [ICRA'20]); 3) performed automatic self-supervised object track mining and object discovery in large video corpora (10+ hours) [ICRA'19].

Work in WP6 focused on creating novel and detailed segmentation annotations for existing (and already publicly available) benchmark datasets using the partially automated annotation tools and workflows developed in WP5. This resulted in the creation of the MOTS-Challenge and KITTI-MOTS benchmark datasets (constructed from the existing MOT-Challenge and KITTI benchmark datasets).
The approaches from WP1 achieved top results in 6 VOS challenges in 2018 and 2019, including 1st places in the CVPR'18 and CVPR'19 DAVIS Challenge competitions and the ECCV'18 and ICCV'19 YouTube-VOS Challenges, showcasing the progress of our work beyond the previous state-of-the-art.

In WP2, SIAM-RCNN [CVPR'20] achieves top performance across a large range of single-object tracking benchmarks. T2R-R2T [ICRA'20] is still among the top-performing tracking approaches on the KITTI Tracking benchmark. In addition, our proposed new MOTS task formulation has in the meantime become widely adopted in the multi-object tracking community. HOTA has been very positively received by the multi-object tracking community, and two of the major multi-object tracking benchmarks, MOT Challenge and KITTI-MOT, have in the meantime switched to HOTA as their primary means of evaluation.

In WP3, our MeTro [FG'20] approach won the 1st place in the ECCV'18 PoseTrack Challenge on 3D Human Pose Estimation. In addition, our MeTrAbs [IEEE TBIOM'21] approach recently won the ECCV'20 "3D Poses in the Wild" Challenge (https://virtualhumans.mpi-inf.mpg.de/3DPW_Challenge/)

In WP4, our DualConvMesh-Net combines geodesic and Euclidean convolutions over 3D geometric data, in order to represent both neighboring the mesh structure and 3D proximity information, reaching state-of-the-art performance on multiple benchmarks.

In WP5, we have developed a robust and efficient generic object tracking pipeline, 4D-GVT [ICRA'20], that can be used for self-supervised object mining from large video collections. This progress enables a new paradigm in training object detectors and trackers, in particular for online learning scenarios.
MOTS (Multi-Object Tracking and Segmentation)
3D Tracking
Video Object Segmentation
3D Semantic Segmentation
3D Instance Segmentation
Generic Object Tracking
Video Mining and Object Discovery
3D Human Body Pose Estimation