Learning Generative 3D Scene Models for Training and Validating Intelligent Systems

Periodic Reporting for period 2 - LEGO-3D (Learning Generative 3D Scene Models for Training and Validating Intelligent Systems)

Reporting period: 2022-04-01 to 2023-09-30

Recently, the field of computer vision has witnessed a major transformation away from expert designed shallow models towards more generic deep representation learning. However, collecting labeled data for training deep models is costly and existing simulators with artist-designed scenes do not provide the required variety and fidelity. Project LEGO-3D will tackle this problem by developing probabilistic models capable of synthesizing 3D scenes jointly with photo-realistic 2D projections from arbitrary viewpoints and with full control over the scene elements. Our key insight is that data augmentation, while hard in 2D, becomes considerably easier in 3D as physical properties such as viewpoint invariances and occlusion relationships are captured by construction. Thus, our goal is to learn the entire 3D-to-2D simulation pipeline. In particular, we will focus on the following problems:

(A) We will devise algorithms for automatic decomposition of real and synthetic scenes into latent 3D primitive representations capturing geometry, material, light and motion.
(B) We will develop novel probabilistic generative models which are able to synthesize large-scale 3D environments based on the primitives extracted in project (A). In particular, we will develop unconditional, conditioned and spatio-temporal scene generation networks.
(C) We will combine differentiable and neural rendering techniques with deep learning based image synthesis, yielding high-fidelity 2D renderings of the 3D representations generated in project (B) while capturing ambiguities and uncertainties.

Project LEGO-3D will significantly impact a large number of application areas. Examples include vision systems which require access to large amounts of annotated data, safety-critical applications such as autonomous cars that rely on efficient ways for training and validation, as well as the entertainment industry which seeks to automate the creation and manipulation of 3D content.

3D Scene Parsing and Annotation

We have developed novel efficient 3D representations and devised algorithms for decomposition of scenes. In particular, we built upon recently proposed implicit representations to capture geometry and appearance. For example, we demonstrated that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of a single large one (KiloNeRF). Motivated by recent advances in the area of monocular geometry prediction, we further systematically investigated the utility which these cues provide for improving neural implicit surface reconstruction (MonoSDF). To benchmark 3D reconstruction, novel view synthesis and simulation approaches, we annotated KITTI-360, a suburban driving dataset which comprises rich input modalities and accurate localization. Given these annotations, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics.

Probabilistic Generative 3D Models

Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. Motivated by the finding that the discriminator cannot fully exploit features from deeper layers of the pretrained model, we propose a more effective strategy that mixes features across channels and resolutions (Projected GAN and StyleGAN-XL), enabling training from little data and on diverse datasets. One key hypothesis of this ERC StG is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. We demonstrated the validity of this hypothesis by representing scenes as compositional generative neural feature fields which allows for disentangling one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision (GIRAFFE, CAMPARI, VoxGRAF).

In the second half of the ERC StG funding period, we plan to combine the results of our unsupervised compositional models (GIRAFFE) with weak supervision provided by manual annotations (KITTI-360) as well as inferred annotations (SAM, Omnidata, Objects) to yield high-resolution photo-realistic compositional 3D aware representations. Moreover, we plan to address inpainting as well as editing of light, texture and materials. We will also extend our current model from static to dynamic scenes. In terms of evaluation, we will develop adversarial methods that enable perturbation of the simulation to yield out-of-distribution training data for downstream tasks (segmentation, detection, sensori-motor control).

GIRAFFE: Representing Scenes as Compositional Generative Neural Feature Fields (CVPR 2021 Best Paper

LEGO-3D: Learning Generative 3D Scene Models for Training and Validating Intelligent Systems

Periodic Reporting for period 2 - LEGO-3D (Learning Generative 3D Scene Models for Training and Validating Intelligent Systems)

Share this page

Download