Periodic Reporting for period 4 - DeeViSe (Deep Learning for Dynamic 3D Visual Scene Understanding)
Reporting period: 2022-10-01 to 2023-03-31
The goal of the DeeViSe project has been to develop novel end-to-end deep learning approaches for dynamic visual scene understanding that break the boundaries between the different vision tasks in order to create vision systems that combine multiple vision modalities within a common deep learning framework. In addition, DeeViSe aims to impart deep learning approaches with a notion of what it means to move through a 3D world by developing neural network architectures that can perform 3D semantic scene understanding. Finally, DeeViSe addresses the question how deep neural networks can be successfully trained with reduced (external) supervision by investigating weakly supervised and self-supervised learning mechanisms, as well as more scalable workflows for human-assisted training data generation.
In order to study those research questions, the DeeViSe project investigated four core computer vision tasks that are essential for scene understanding: Pixel-level Semantic Scene Analysis, combining object recognition and segmentation (WP1), Deep Object Tracking (WP2), Dynamic Human Pose Analysis (WP3), and Deep Representations for 3D Scene Understanding (WP4). These tasks are supported by activities on Learning with Seduced supervision (WP5) and Dataset Curation and Annotation (WP6).
WP2 aimed at building end-to-end deep learning approaches for single- and multi-object tracking. We made good progress towards this goal and developed state-of-the-art approaches for both tasks (SIAM-RCNN [CVPR'20], T2R-R2T [ICRA'20]). Moreover, we defined the task of "Multi-Object Tracking and Segmentation" (MOTS) [CVPR'19] and established it in the computer vision community, together with an annotated benchmark dataset and evaluation methodology (MOTS-Challenge). As previous tracking evaluation methods suffered from systematic problems, we developed a novel tracking evaluation methodology, Higher-Order Tracking Accuracy (HOTA) [IJCV'20], which has by now become a new standard for tracking evaluation.
WP3 focused on developing deep learning approaches for human motion analysis. Main outcomes were novel approaches for 3D body pose estimation (MeTrO [FG'20], MeTrAbs [IEEE TBIOM'21]) that won competitions at ECCV'18 and ECCV'20. Finally, we developed a principled approach for multi-dataset training of body pose estimation models [WACV'23] that resulted in a new state of the art in 3D human pose estimation quality.
The goal of WP4 was to interface the geometry-centric representations from traditional Computer Vision methods with the semantic analysis capabilities of deep learning based vision modules. For this, we worked on developing deep learning representations for 3D point clouds and 3D geometry, resulting in novel approaches for 3D semantic segmentation (DualConvMesh-Net [CVPR'20]), 3D instance segmentation (3DBEVIS [GCPR'19], 3D-MPA [CVPR'20]), 3D data augmentation (Mix3D [3DV'21]), and representation learning (Point2Vec, [GCPR'23]). Finally, we developed a novel mask transformer based architecture for 3D semantic segmentation tasks (Mask3D [ICRA'23]).
WP5 addressed the problem of learning deep networks with reduced human supervision. Towards this goal, we developed 1) automated tools for interactive, human-assisted image segmentation (ITIS [BMVC'18], DynaMITe [ICCV'23]); 2) an automated workflow for human-assisted segmentation annotation of entire video datasets [WACV'21]; 3) category-agnostic multi-object tracking approaches (4D-GVT [ICRA'20], OWT [CVPR'22]) that can be used for automatic mining of candidate object tracks in large video collections [ICRA'19]. Those tools significantly reduce the manual annotation effort in the creation of large-scale training and evaluation datasets.
WP6 was originally intended to focus on dataset collection and annotation. Due to ethical and data protection concerns during the Ethical Review phase, the plan to record and collect new datasets was not pursued further. Instead, we focused our efforts on creating novel and detailed segmentation annotations for existing (and already publicly available) benchmark datasets using the partially automated annotation tools and workflows developed in WP5. This resulted in the creation of the MOTS-Challenge [CVPR'19], the TAO-OWT [CVPR'22 oral] , and the BURST [WACV'23] benchmark datasets.
Our work on pixel-level dynamic scene analysis (WP1) was very influential in pushing ahead research on Video Object Segmentation (VOS) through a series of highly visible papers and successes in VOS competitions. Our work has also been instrumental in connecting the previously separate areas of object detection, tracking, and segmentation, with multiple WP2 results (MOTS [CVPR'20], HOTA [IJCV'20], Open-World Tracking [CVPR'22]) significantly contributing to this development. Focusing on the visual analysis of human motion as one very important application area (WP3), we have developed a novel 3D human body pose estimation approach, MeTrAbs [IEEE TBIOM'21] that reaches world-leading 3D body pose estimation accuracies.
Working towards our goal of endowing deep learning based computer vision systems with better 3D scene understanding capabilities, we have developed in WP4 a series of state-of-the-art approaches for 3D semantic segmentation, including Mask3D [ICRA'23] that is on track to becoming a standard architecture for 3D scene understanding. In particular, our finding that the same mask transformer base architecture can be used to solve a large variety of 2D and 3D semantic scene analysis tasks (as exemplified in TarViS [CVPR'23], DynaMITe [ICCV'23] and Mask3D [ICRA'23]) bears significant promise for our envisioned unification of those different capabilities into a common deep learning framework.
Finally, our work in WP5 has also resulted in the development of methods and workflows that can significantly reduce the manual annotation effort for training large vision systems.