Periodic Reporting for period 3 - G-Statistics (Foundations of Geometric Statistics and Their Application in the Life Sciences)
Reporting period: 2021-09-01 to 2023-02-28
Geometric statistics strive for developing a rigorous statistical theory on manifolds and more generally on spaces with a geometric structure. By looking at statistics from a geometric point of view, the G-Statistics project aims at strengthening their mathematical foundations and at exemplifying the impact on selected applications in the life sciences. So far, mainly Riemannian manifolds and negatively curved metric spaces were studied in depth. Other geometric structures like Lie groups, affine connection spaces, quotient and stratified spaces naturally arise in applications. G-Statistics aims at exploring ways to unify statistical estimation theories, explaining how the statistical estimations diverges from the Euclidean case in the presence of curvature, singularities, stratification. The goal is to tackle summary statistics more complex than the Fréchet mean and to develop new subspace learning and dimension reduction methods. Beyond the mathematical theory, the project aim at implementing generic estimation algorithms and at illustrating the impact of some of their efficient specializations on selected manifolds. The considered applications in life sciences include in particular the study of anatomical shapes and the forecast of their evolution from databases of medical (computational anatomy).
A first foundational result obtained from these is the numerical accuracy analysis of the discrete ladder algorithms for parallel transport on manifolds, with exact or approximated geodesics. From an artificial intelligence point of view, parallel transport can be seen as the natural method for mapping statistics around one point to another point, a problem called domain adaptation. Pole ladder is particularly appealing since it only relies on geodesic symmetry and mid-point computations. Moreover, it exhibit a higher intrinsic accuracy than other classical methods. Remarkably, it is even exact in one single step in symmetric spaces. It is thus a very simple algorithm for parallel transport on Riemannian and affine connection spaces that leverages the power of many geometric implementations of continuous or discrete geodesics.
The impact of the manifold curvature on the estimation of the empirical Fréchet mean was a second essential result. We found an unexpected bias of the empirical mean in 1/n, which is important in the small sample regime, and a modulation of the convergence rate of the covariance matrix proportional to the covariance-curvature tensor. These results unveil an intermediate behavior of the empirical mean in manifolds linking two of the major new phenomena discovered recently in geometric statistics: stickiness and smeariness. The lesson is that one may needs drastically more samples in a positively curved manifold than in a Euclidean space to estimate a quantity up to a certain uncertainty. On the contrary, less samples are needed in negatively curved spaces, even though unbounded negative curvature singularities may lead to uninformative sticky estimations.
From the technological point of view, we have contributed to the python package geomstats (https://geomstats.github.io/) a generic library for statistical computing algorithms on different geometric structures. The package currently supports more than 15 manifolds with closed-form geodesics (when known) or discrete geodesics obtained by optimization otherwise. This package encompasses complex notions of Riemannian geometry embedded into a consistent object-oriented API that makes it readable and editable by mathematicians. From an applied perspective, the use of the algorithms does not require a deep understanding of the mathematics and is made easy thanks to a standard Scikit learn interface for artificial intelligence.
We also expect to obtain a geometric formulation of principle component analysis (PCA) where the resulting flag of subspaces is the projection of the population/empirical covariance matrix on a specific submanifold. This construction should lead to a very simple central limit theorem that would give rise to computable confidence intervals for PCA with finite samples. Such results would also be very useful to control the statistical validity of ubiquitous spectral-based algorithms on images or graphs relying on the spectral decomposition of the Laplacian.
The intrinsically geometric formulation of PCA as an optimization on flags of subspaces is a third interesting direction of research. The nestedness property guaranties that data approximations spaces are compatible at low and higher orders. This could be the key mathematical feature for multiscale data analysis since flags of subspaces are natural mathematical object to encode hierarchically embedded approximation spaces. Although ‘no geometry is more true than another’, we hope to be able to construct series of nested geometries that are more and more convenient to describe the data.