Periodic Reporting for period 1 - CoMBo (Correspondence through Millions Bodies: a large-scale, functional, and implicit data-driven method for 3D Humans matching)
Reporting period: 2023-07-01 to 2024-07-31
dawn of civilization. The advent of computers and the study of Geometry Processing opened dramatic advancements and possibilities, like simulated surgeons' operations or whole digital universes. Among all the physical entities, the human bodies play a central role. Modern technologies can acquire, digitalize, and imitate human appearance at a stage that is indistinguishable from reality. Interest in Virtual Humans is growing fast in public opinion and several scientific and economic fields, from entertainment to medicine, social sciences to ergonomics. The market of Virtual Avatars has an estimated value of around USD 10 billion and is expected to reach more than USD 520 billion in 2030. However, our understanding can only be complete with tools to establish analogies: what is similar, what is different, or, namely, what is in correspondence. Finding such connections between human geometry is the key enabler for several downstream applications such as virtual try-on, texture transfer, or performing anthropometric statistics. Computer Vision and Graphics have studied the problem intensely since its fundamental and applicative relevance. Still, no method has affirmed itself as a robust and flexible standard to obtain robust and precise 3D correspondence across different humans. Noise, garments, objects, or partiality often pose challenges that require ad-hoc strategies. The CoMBo (Correspondence through Millions Bodies) project aims to fill this gap by combining multiple techniques: relying on a vast human dataset, exploiting flexible geometrical representation, and developing a novel data-driven framework cable to handle this comprehensive set of challenges. CoMBo will produce substantial scientific, economic, and social impacts in this strategic field, carrying an intense outreach to a broad audience, informing on bodies digitalization process and the importance of fair representation of the human experience in this fast-changing technology.
1) Data collection: To achieve our goal of a robust starting from a publicly available dataset of motion capture sequences (AMASS). This dataset contains millions of frames of human motion, capturing thousands of actions and sequences. Also, these 3D shapes are encoded as SMPL parameters, providing a natural correspondence across all the shapes. Hence, we first subsample the dataset to around one hundred thousand frames since nearby temporal instants contain similar semantic information. Then, we convert the SMPL parametrs into their 3D model into an implicit representation, in particular as unsigned distance fields. Finally, we store each sample with its original point cloud, which provides a 3D discrete ground truth to the implicit signed distance field.
2) Exploiting Geometric representations: at the start of our analysis, we observed that the advancement of Neural Representations has provided a new set of tools and flexibility. These representations can be queried everywhere in the 3D space, providing precise discrete information while retaining their continuous nature. In particular, we found inspiration in the recent Learned Vertex Descent (LVD) paradigm. LVD aims to train a neural network backbone to predict, for every point in space, the offset toward the ground truth positions of the registration points. In our exploration, we first improve this method by proposing LoVD (Localized Vertex Descent). This new variant includes a geometrical inductive bias to attend to local regions of the bodies. Despite this, we noted that the network could not generalize to geometries significantly out of distribution (e.g. noise, cloths), which may substantially impact the input features, resulting in misalignment between the prediction and the target surface. We consider misalignment between the template registration and the target shape to be trivial errors: the template should generally lie precisely on the target, which is the solution space for the template vertices. Hence, we realized that fine-tuning the backbone neural field to respect this property at inference time would be more efficient than augmenting our dataset. Inspired by the popular iterative registration approaches, we proposed a novel self-supervised task called Neural Iterative Closest Point (NICP), specifically designed to refine Neural Fields at inference time. NICP iteratively queries directly on the target surface and penalizes when no offset is close to zero, i.e. reduces deviations toward misaligned solutions. NICP takes a few seconds and lets the network adapt without requiring ad hoc techniques or expensive data augmentations.
3) 3D Registration Pipeline: Finally, we incorporate LoVD trained on a large dataset with NICP into a complete registration pipeline, also accounting for a Chamfer Distance optimization of the final prediction, together with a local displacement optimization to catch the finest details of the target. We call our method Neural Scalable Registration (NSR), and we validate it on thousands of shapes coming from more than ten different resources, public benchmarks, and several challenges.
Finally, we released our code publicly on GitHub, freely usable for research purposes.
We also obtained promising results on partial scans from single depth view and noisy point clouds from the fusion of multiple Kinects. These are actual data often used in industrial pipelines. We provide a method that can be directly applied to these data sources without costly ad-hoc calibrations and unify the acquisitions coming from disparate sources (e.g. multi-view reconstruction, body scans, artist-made 3D models). We also demonstrate the impact on typical downstream applications, such as texture transfer and automatic animation of avatars.