# Foundations of Geometric Statistics and Their Application in the Life Sciences

## Periodic Reporting for period 2 - G-Statistics (Foundations of Geometric Statistics and Their Application in the Life Sciences)

Reporting period: 2020-03-01 to 2021-08-31

Geometry proved to be a foundational mathematical aspect of many theories in physics such as geometric and statistical mechanics, the space-time structure in relativity, and particles physics where invariance under gauge transformation groups provides the natural mathematical structure. In other domains such as life sciences the elementary laws are less obvious. Here, we believe that geometry could be decisive for distinguishing underlying mechanisms from measurement noise. For that purpose, we need to develop new mathematical tools for estimating approximate invariance and learning general laws from data. Following the famous statement of Poincaré “a geometry cannot be more true than another, it may just be more convenient”, the goal would be identify the most convenient geometry for the analyzed data. However, despite the ubiquity of non-linearity nowadays in data science, statistics are often times performed as if we were in a Euclidean space, thus neglecting the potentially drastic effects of non-linearities and singularities on the statistical estimation.

Geometric statistics strive for developing a rigorous statistical theory on manifolds and more generally on spaces with a geometric structure. By looking at statistics from a geometric point of view, the G-Statistics project aims at strengthening their mathematical foundations and at exemplifying the impact on selected applications in the life sciences. So far, mainly Riemannian manifolds and negatively curved metric spaces were studied in depth. Other geometric structures like Lie groups, affine connection spaces, quotient and stratified spaces naturally arise in applications. G-Statistics aims at exploring ways to unify statistical estimation theories, explaining how the statistical estimations diverges from the Euclidean case in the presence of curvature, singularities, stratification. The goal is to tackle summary statistics more complex than the Fréchet mean and to develop new subspace learning and dimension reduction methods. Beyond the mathematical theory, the project aim at implementing generic estimation algorithms and at illustrating the impact of some of their efficient specializations on selected manifolds. The considered applications in life sciences include in particular the study of anatomical shapes and the forecast of their evolution from databases of medical (computational anatomy).
Xavier Pennec edited with Stefan Sommer and Tom Fletcher a book presenting the status of the methodological foundations and applications of geometric statistics in medical image. It was published in 2020 in the Elsevier and MICCAI Society book series. Beyond this state of the art, we made advances for understanding the implications of the geometric space structure on the statistical estimation theory. For smooth Riemannian and affine connection spaces, we have developed new coordinate free and tensorial Taylor expansions of geodesics providing polynomial approximations at any order of problems related to geodesic triangles.

A first foundational result obtained from these is the numerical accuracy analysis of the discrete ladder algorithms for parallel transport on manifolds, with exact or approximated geodesics. From an artificial intelligence point of view, parallel transport can be seen as the natural method for mapping statistics around one point to another point, a problem called domain adaptation. Pole ladder is particularly appealing since it only relies on geodesic symmetry and mid-point computations. Moreover, it exhibit a higher intrinsic accuracy than other classical methods. Remarkably, it is even exact in one single step in symmetric spaces. It is thus a very simple algorithm for parallel transport on Riemannian and affine connection spaces that leverages the power of many geometric implementations of continuous or discrete geodesics.

The impact of the manifold curvature on the estimation of the empirical Fréchet mean was a second essential result. We found an unexpected bias of the empirical mean in 1/n, which is important in the small sample regime, and a modulation of the convergence rate of the covariance matrix proportional to the covariance-curvature tensor. These results unveil an intermediate behavior of the empirical mean in manifolds linking two of the major new phenomena discovered recently in geometric statistics: stickiness and smeariness. The lesson is that one may needs drastically more samples in a positively curved manifold than in a Euclidean space to estimate a quantity up to a certain uncertainty. On the contrary, less samples are needed in negatively curved spaces, even though unbounded negative curvature singularities may lead to uninformative sticky estimations.

From the technological point of view, we have contributed to the python package geomstats (https://geomstats.github.io/) a generic library for statistical computing algorithms on different geometric structures. The package currently supports more than 15 manifolds with closed-form geodesics (when known) or discrete geodesics obtained by optimization otherwise. This package encompasses complex notions of Riemannian geometry embedded into a consistent object-oriented API that makes it readable and editable by mathematicians. From an applied perspective, the use of the algorithms does not require a deep understanding of the mathematics and is made easy thanks to a standard Scikit learn interface for artificial intelligence.
Reformulating summary statistics such as the Fréchet mean or k-Means as geometric projections of population and empirical distributions on carefully selected subspaces would lead to major advances in our geometric statistics agenda. This should allow a much simpler characterization of their uniqueness and the development of unified theorems for asymptotic convergence properties.

We also expect to obtain a geometric formulation of principle component analysis (PCA) where the resulting flag of subspaces is the projection of the population/empirical covariance matrix on a specific submanifold. This construction should lead to a very simple central limit theorem that would give rise to computable confidence intervals for PCA with finite samples. Such results would also be very useful to control the statistical validity of ubiquitous spectral-based algorithms on images or graphs relying on the spectral decomposition of the Laplacian.

The intrinsically geometric formulation of PCA as an optimization on flags of subspaces is a third interesting direction of research. The nestedness property guaranties that data approximations spaces are compatible at low and higher orders. This could be the key mathematical feature for multiscale data analysis since flags of subspaces are natural mathematical object to encode hierarchically embedded approximation spaces. Although ‘no geometry is more true than another’, we hope to be able to construct series of nested geometries that are more and more convenient to describe the data.