Rich, Structured Models for Scene Recovery, Understanding and Interaction

Artificial Intelligence and in particular Computer Vision has gained considerable momentum in recent years, both in industry and academia. There is the spirit that the time is ripe to realize grand goals and to bring Computer Vision from the lab into real life. This transition has huge impact for the society, such as autonomously driving cars, new ways of communication and digital health, to mention a few. One example of such a transition is the recent announcement of Meta (former Facebook) that the internet will transform to a “Metaverse”, where many user-interactions and commerce will take place.
But is a computer vision system already as good as a human is? The answer is: “Unfortunately, not quite yet.” Given a single image, a child can effortless describe the objects and their relationships in a much more detailed manner than any computer can. Also, humans can quite effortlessly “visually extract” an object from its background, even in the presence of fine details such as hair. Computers cannot yet achieve this fully automatically. But, for many real-world applications, such as Virtual and Augmented Reality it is ultimately a necessity to reach such levels accuracy and of rich output.
The objective of this project is to get a step closer to this overarching goal. There are a few key aspects to make a significant step forward. We bundle these aspects under the term “Rich Scene Model (RSM)”. One aspect of this richness is to build a so-called joint model, where semantic properties, physical properties and prior-knowledge of a task are combined in new ways. Other aspects of this richness are to build models that are robust to input noise, or models that can predict their own level of quality.

In the following, I highlight the different aspects, which we have discovered to improving the quality and robustness of computer vision models.
A first aspect we considered is to make a computer vision model more robust towards input-data noise. We examined ways to combine deep neural networks with sampling-based techniques. We did this in the context of camera localization in 3D reconstructions. A typical robust estimation method for this task is the well-known RANSAC algorithm from 1981. We achieved a fusion of RANSAC with a deep neural network by minimizing the expected loss of the neural network [1]. We conducted this initial work in 2017, and it lead to a line of work in which we pushed the state-of-the art for camera localization over several years.
In the context of “model robustness” we joined forces with other groups from MPI Tubingen, TU Munich, ETH Zürich to quantify the robustness aspect in computer vision applications. We launched the robust vision challenge. The first challenge has been running with CVPR 2018 for the first time, and it is still ongoing (www.robustvision.net).
A second aspect we considered are new ways for training data generation. Good and diverse training data is one of the keys for building high-quality and robust models. In [2] we examined a new way of training data generation by taking existing footage and augment it with virtual models. As an example, we utilized a pool of synthetic car-models and placed them real traffic scene. By doing so, we were able to improve the state-of-the-art for instance segmentation by about 8%.
A third, and major aspect, of the grant was to consider the impact of the synergy effect in computer vision. An example is to jointly segment an image into its semantic classes (e.g. car, road, and trees) and, at the same time, estimate the depth of every single pixel. The outcome of our study of the “synergy aspect” is that the “devil is in the detail”. The work on combining object instance recognition and scene flow [3] is an example of a rich model, which combines physical and semantical information. We explored various levels of integration and concluded that a mid-level integration, in the form of bounding box detection, works overall best. By doing so, we are able to achieve state-of-the-art-results for scene flow estimation. In the recent work [4] we have seen the advantage of jointly training a model that decomposes and composes (renders) an image given its 3D shape and material appearance. Another example of a rich scene model is our work for cell-tracking in large 3D volumes [5]. The idea is to have a large pool of cell-segmentation-candidates and then solve cell-tracking as a cell-by-cell assignment problem. We found that this formulation of the synergy effect was optimal. A more flexible model-formulation where tracking and segmentation is defined on the pixel-level (in contrast to a pre-computed pool of segmentations) performed worse for us. In this context, we also considered a more theoretical question of finding the optimal, diverse segmentation pool. This problem can be defined as “finding the M-Best-Diverse solutions of a structured energy function”. For a certain class of energy functions, it can actually be solved globally optimal [6].
Towards the end of ERC grant we considered the aspect of predicting the level of quality of a model. The ability of a model to predict its own quality is essential when putting it into a bigger system. We were interested in building a generative classifier, which means that the classifier does not only predict the optimal classification, but also the probability of this classification. We achieved state-of-the-art results in out-of-distribution detection by utilizing a so-called invertible neural network [7,8].

Please find a complete list of publications, associated with the ERC, on our webpage.

[1] E. Brachmann, A. Krull, S. Nowozin, J. Shotton, F. Michel, S. Gumhold, C. Rother, “DSAC – Differentiable RANSAC for Camera Localization”, CVPR 2017 (oral).
(It nominated for best student paper award at CVPR 2017, which is one of the primary computer vision conferences, with around 3000 submissions)
[2] H. Abu Alhaija, S.K. Mustikovela, L. Mescheder, A. Geiger, C. Rother, “Augmented Reality Meets Computer Vision Efficient Data Generation for Urban Driving Scenes”, IJCV 2018.
[3] Behl, O. Hosseini Jafari, S. K. Mustikovela, H. Abu Alhaija, C. Rother, A. Geiger, “Bounding Boxes, Segmentations and Object Coordinates: How Important is Recognition for 3D Scene Flow Estimation in Autonomous Driving Scenarios?”, ICCV 2017.
[4] H. Abu Alhaija, S.K. Mustikovela, J. Thies, V. Jampani, M. Nießner, A. Geiger, C. Rother (2020) “Intrinsic Autoencoders for Joint Deferred Neural Rendering and Intrinsic Image Decomposition”, 3DV 2020.
[5] S. Haller, M. Prakash, L. Hutschenreiter, T. Pietzsch, C. Rother, F. Jug, P. Swoboda, B. Savchynskyy, “A Primal-Dual Solver for Large-Scale Tracking-by-Assignment”, AISTATS 2020.
[6] A. Kirillov, A. Shekhovtsov, C. Rother, B. Savchynskyy, “Joint M-Best-Diverse Labelings as a Parametric Submodular Minimization”, NIPS 2016.
[7] R. Mackowiak, L. Ardizzone, U. Köthe, C. Rother (2021). “Generative Classifiers as a Basis for Trustworthy Image Classification”, CVPR 2021 (oral)
[8] L. Ardizzone, R. Mackowiak, C. Rother, U. Köthe (2020). “Training Normalizing Flows with the Information Bottleneck for Competitive Generative Classification”, NeurIPS 2020 (oral)

Periodic Reporting for period 4 - RSM (Rich, Structured Models for Scene Recovery, Understanding and Interaction)

Diese Seite teilen

Herunterladen