Periodic Reporting for period 2 - LEGO-3D (Learning Generative 3D Scene Models for Training and Validating Intelligent Systems)
Reporting period: 2022-04-01 to 2023-09-30
(A) We will devise algorithms for automatic decomposition of real and synthetic scenes into latent 3D primitive representations capturing geometry, material, light and motion.
(B) We will develop novel probabilistic generative models which are able to synthesize large-scale 3D environments based on the primitives extracted in project (A). In particular, we will develop unconditional, conditioned and spatio-temporal scene generation networks.
(C) We will combine differentiable and neural rendering techniques with deep learning based image synthesis, yielding high-fidelity 2D renderings of the 3D representations generated in project (B) while capturing ambiguities and uncertainties.
Project LEGO-3D will significantly impact a large number of application areas. Examples include vision systems which require access to large amounts of annotated data, safety-critical applications such as autonomous cars that rely on efficient ways for training and validation, as well as the entertainment industry which seeks to automate the creation and manipulation of 3D content.
We have developed novel efficient 3D representations and devised algorithms for decomposition of scenes. In particular, we built upon recently proposed implicit representations to capture geometry and appearance. For example, we demonstrated that significant speed-ups are possible by utilizing thousands of tiny MLPs instead of a single large one (KiloNeRF). Motivated by recent advances in the area of monocular geometry prediction, we further systematically investigated the utility which these cues provide for improving neural implicit surface reconstruction (MonoSDF). To benchmark 3D reconstruction, novel view synthesis and simulation approaches, we annotated KITTI-360, a suburban driving dataset which comprises rich input modalities and accurate localization. Given these annotations, we established benchmarks and baselines for several tasks relevant to mobile perception, encompassing problems from computer vision, graphics, and robotics.
Probabilistic Generative 3D Models
Generative Adversarial Networks (GANs) produce high-quality images but are challenging to train. Motivated by the finding that the discriminator cannot fully exploit features from deeper layers of the pretrained model, we propose a more effective strategy that mixes features across channels and resolutions (Projected GAN and StyleGAN-XL), enabling training from little data and on diverse datasets. One key hypothesis of this ERC StG is that incorporating a compositional 3D scene representation into the generative model leads to more controllable image synthesis. We demonstrated the validity of this hypothesis by representing scenes as compositional generative neural feature fields which allows for disentangling one or multiple objects from the background as well as individual objects' shapes and appearances while learning from unstructured and unposed image collections without any additional supervision (GIRAFFE, CAMPARI, VoxGRAF).