Skip to main content

Performance Capture of the Real World in Motion

Final Report Summary - CAPREAL (Performance Capture of the Real World in Motion)

CapReal developed important algorithmic foundations of the next generation of performance capture methods. The long term goal is to enable dynamic shape, motion and appearance reconstruction – with a focus on human reconstruction - at previously unseen detail, in general scenes (also outdoors), and with only few cameras. The project researches foundational algorithmic questions at the intersection of computer vision and computer graphics, and also began to explore new ways to methodically integrate machine learning concepts from AI with new dynamic scene capture algorithms. The project has greatly advanced the state of the art in marker-less performance capture. In total, more than 80 peer reviewed research papers at high quality publication venues were published during the 5 year project duration.
This includes 20 papers in the top computer graphics conferences (SIGGRAPH, SIGGRAPH Asia, EUROGRAPHICS, published as special issues of ACM Trans. Graphics (TOG) and Computer Graphics Forum (CGF), respectively), 22 papers in the top vision conferences (CVPR, ECCV, ICCV), and 15 papers in top journals in vision and graphics (e.g. ACM TOG, CGF, IEEE PAMI, IEEE TVCG). It also includes an edited book on Digital Representations of the Real World with CRC Press. Team research team also entertains collaborative research with leading international research institutions, including Stanford University, Microsoft Research, Technicolor Research, the University of Erlangen Nuremberg, TU Munich, ETH Zuerich, EPFL, the University of Hong Kong, and UCL, to name a few.

In the following, we highlight a few milestone results, in particular outcomes of cross-disciplinary relevance and results benefiting from unconventional research approaches.

We extended and combined ideas from computer graphics and computer vision in a new way to enable new inverse rendering methods. These new methods estimate much more detailed models of shape, illumination and reflectance from sparse imagery recorded in less controlled environments than previously possible. This enabled us, in turn, to do shading-based refinement in general scenes at much higher detail than previously feasible, to estimate much more detailed appearance and illumination models in uncalibrated environments, and to use these extracted models to improve correspondence finding and 4D reconstruction in general scenes. In conjunction with the new high performance non-linear solvers we developed, even dense real-time reconstruction and inverse rendering from stereo camera views or single camera views is, for the first time, feasible for certain types of scenes.

We further developed new scene representation and 4D reconstruction algorithms that are scalable to dense scenes (many scene elements, difficult deformations, occlusions, apparent topology changes etc.). These enabled, for instance, one of the first methods for performance capture of closely interacting subjects, as well as a new implicit formulation for analysis-by-synthesis reconstruction in less controlled scenes featuring a new visibility formulation analytically differentiable everywhere. Our new representations also enable combined general deformable scene capture and template reconstruction from sparse camera input in real-time, without needing to provide a static shape model a priori.

We further investigated new methods to learn and exploit scene priors (data-driven or physics-based) for improved 4D reconstruction, as well as user-guided intuitive interpretation and modification of captured scenes. As an example, we proposed new methods to estimate semantically meaningful deformation subspaces, as well as new approaches to design and learn lower-dimensional motion subspaces of arbitrary deforming shapes; both of them enable improved 4D reconstruction in less controlled scenes, and improved animation editing. We also showed new ways to learn parametric deformable shape models from only weakly labelled real world image data, instead of having to resort to complex controlled scanning devices.

In the second phase of the project, we began to develop new ways of combining the aforementioned new generative reconstruction concepts with machine learning-based detection and classification methods for improved 4D reconstruction in less controlled environments. This lead to several milestone achievements of the project, namely some of the first approaches to do real-time skeletal human motion capture from a single color camera, a new generation of methods to do high-quality and real-time dense reconstruction of dynamic face geometry and appearance from single camera views, as well as some of the seminal methods to do marker-less hand and object motion capture in real-time from sparse camera input.

Another important outcome of the project is the GVVPerfcapEva repository of research data sets ( and research code. We make available a wide range of shape and performance capture data sets created in the CapReal project, and in collaborations with partner research groups at MPI for Informatics and other institutes. These data sets provide an opportunity to enable and evaluate new algorithms targeting different sub-fields of performance capture, such as general deformable shape capture, full body performance capture, facial performance capture, or performance capture of hand and finger motions. We also make available example implementations of several of our algorithms for reference, and provide tools to simplify the development of efficient numerical optimizers for the challenging non-convex optimization problems frequently occurring in dynamic scene capture research.