Periodic Reporting for period 1 - SIMULACRON (SIMULACRON: From Camera Observations to Physical Simulations of the 3D World)
Reporting period: 2021-01-01 to 2022-06-30
In our research, we teach machines to understand the world around them from video observations. More specifically, in our project "understanding" means to generate a physical simulation of the observed world from video observations that allows the machine to re-synthesize the video observation and - ideally - extrapolate the observed action - for example as a bouncing ball - into the future. While computer vision for the last decades has been focused on reconstruction the surface of things from camera images or videos, we thus go a significant step further and generate an entire physical simulation of the observed phenomenon thereby creating a more holistic understanding.
I believe that this is a very central step in our effort to reproduce the human capacity to understand the world. Through evolution humans have developed a powerful capacity to understand and even predict what happens around them. This is most apparent in sports games where - for example - a soccer player will jump up at the right time and location in order to head an oncoming ball into the corner of the goal showing that she is capable of predicting at extreme precision where and when the oncoming ball will arrive and how exactly it will bounce off from her head. For humans, this skill to understand an action at a level that allows us to precise predict the evolution of things is vital for survivial (allowing us to hunt or to evade an imminent danger). In order for machines to properly interact and coexist with humans, I believe that they too need to develop this capacity.
It is a huge challenge that requires the development of novel algorithms for reconstructing 3D shape from cameras and videos, for analyzing and simulating 3D shape and for inferring physical simulations of deformable shapes from video.
I believe that this is a very central step in our effort to reproduce the human capacity to understand the world. Through evolution humans have developed a powerful capacity to understand and even predict what happens around them. This is most apparent in sports games where - for example - a soccer player will jump up at the right time and location in order to head an oncoming ball into the corner of the goal showing that she is capable of predicting at extreme precision where and when the oncoming ball will arrive and how exactly it will bounce off from her head. For humans, this skill to understand an action at a level that allows us to precise predict the evolution of things is vital for survivial (allowing us to hunt or to evade an imminent danger). In order for machines to properly interact and coexist with humans, I believe that they too need to develop this capacity.
It is a huge challenge that requires the development of novel algorithms for reconstructing 3D shape from cameras and videos, for analyzing and simulating 3D shape and for inferring physical simulations of deformable shapes from video.
We have tackled the above challenge from multiple angles:
1. We developed novel algorithms for reconstructing the 3D world from multiple images. In the paper [Demmel et al. CVPR 2021], we developed a novel numerical solution for the classical problem of Bundle Adjustment which aims to reconstruct the 3D world and the camera locations from a multitude of images. Compared to the state of the art in this field, the proposed algorithm is numerically more stable and significantly faster (often up to 40% faster) than the fastest competing method.
2. We developed a neural network approach called MonoRec [Wimbauer et al. CVPR 2021] that allows us to generate a dense reconstruction of a large-scale world from a single drive-through with a single camera. The resulting model is an almost photorealistic copy of the world, something often called a digital twin. It can serve as the basis for augmented reality applications or for testing self-driving cars in a simulated world that is a fairly exact copy of the real world.
3. We developed a neural network approach called Neurmorph [Eisenberger et al. CVPR 2021] that allows us to compute the exact pointwise correspondence between two given 3D shapes and a family of interpolating 3D shapes. We demonstrate that it can be used for digital pupeteering - i.e. transfering the dynamics of an observed 3D shape onto another 3D shape.
4. We developed a method called i3DMM [Yenamandra et al. CVPR 2021]. It makes use of deep networks to synthesize 3D head models including aspects like hair styles and others. This extends the classical deformable shape approaches to the age of deep learning.
5. In [Eisenberger et al. CVPR 2022], we derived a unified mathematical framework for computing correspondence in a deep network architecture. This approach will likely be of value to any deep network that aims to compute correspondence - correspondence between points on two 3D shapes, correspondence between pixels in an image or other.
6. In [Hofherr et al. WACV 2023], we will present a method that allows us to compute a physical simulation directly from video. In contrast to our earlier work in [Weiss et al. CVPR 2020], the current approach makes use of a neural network and learning.
1. We developed novel algorithms for reconstructing the 3D world from multiple images. In the paper [Demmel et al. CVPR 2021], we developed a novel numerical solution for the classical problem of Bundle Adjustment which aims to reconstruct the 3D world and the camera locations from a multitude of images. Compared to the state of the art in this field, the proposed algorithm is numerically more stable and significantly faster (often up to 40% faster) than the fastest competing method.
2. We developed a neural network approach called MonoRec [Wimbauer et al. CVPR 2021] that allows us to generate a dense reconstruction of a large-scale world from a single drive-through with a single camera. The resulting model is an almost photorealistic copy of the world, something often called a digital twin. It can serve as the basis for augmented reality applications or for testing self-driving cars in a simulated world that is a fairly exact copy of the real world.
3. We developed a neural network approach called Neurmorph [Eisenberger et al. CVPR 2021] that allows us to compute the exact pointwise correspondence between two given 3D shapes and a family of interpolating 3D shapes. We demonstrate that it can be used for digital pupeteering - i.e. transfering the dynamics of an observed 3D shape onto another 3D shape.
4. We developed a method called i3DMM [Yenamandra et al. CVPR 2021]. It makes use of deep networks to synthesize 3D head models including aspects like hair styles and others. This extends the classical deformable shape approaches to the age of deep learning.
5. In [Eisenberger et al. CVPR 2022], we derived a unified mathematical framework for computing correspondence in a deep network architecture. This approach will likely be of value to any deep network that aims to compute correspondence - correspondence between points on two 3D shapes, correspondence between pixels in an image or other.
6. In [Hofherr et al. WACV 2023], we will present a method that allows us to compute a physical simulation directly from video. In contrast to our earlier work in [Weiss et al. CVPR 2020], the current approach makes use of a neural network and learning.
We plan to venture further into the simulation of deformable 3D shapes and the development of algorithms to compute such simulations from a video observation. This will lead us closer to the overall goal of this project. At the same time, we continue to develop algorithms for 3D shape analysis and algorithms for camera-based 3D reconstruction. This assures that even if the final goal is not met, we still generate highly valuable contributions to the fields of 3D shape analysis and camera-based reconstruction.