Periodic Reporting for period 1 - QuRe-ViMaL (Quantitative Rectifiability: from Vitushkin's conjecture to Manifold Learning)
Periodo di rendicontazione: 2024-01-01 al 2026-01-31
Sintesi del contesto e degli obiettivi generali del progetto
The project QuRe-ViMal links two a priori very different area of quantitative research: quantitative geometric measure theory (Q-GMT) on one hand, and statistical learning theory (SLT) on the other.
The former is a branch of pure mathematics. It grew out of purely theoretical questions in complex analysis and it deals with understanding the geometry of sets and measures in Euclidean and more general spaces. Imagine for example a two-dimensional surface in the three dimensional Euclidean space. Its degree of smoothness is one of its key properties: for example, it determines whether this surface represents an amenable domain where to solve PDEs, compute derivatives or integrals. Creating holes and sharp corners in a smooth surface turn it rough: these are features which, if too abundantly present, prevent doing analysis and PDEs. The main focus of Q-GMT is then to study the geometry of possibly very high dimensional objects (sets and measures) via quantifying the presence of holes and corners. In the last twenty years Q-GMT proved a very powerful tool in extending analysis and PDEs to a much larger class of `surfaces' (of arbitrary dimension), which are known as Quantitatively Rectifiable sets (or measures). In other words: smoothness is not needed. What is necessary, instead, is a precise quantification and control of holes and corners.
The latter research area, STL, is tasked to formalise when a model constructed out of observations, that is, data sets, has predictive capacities with respect to the phenomenon observed. For example, it gives criteria to check whether an interpolating function is overfitting or underfitting. STL and machine learning are afflicted by the so-called curse of dimensionality: the fact that the computational costs scale exponentially with the dimension of the dataset. What saves the day is that more often than not data points coagulate near geometric objects, such as smooth submanifolds, whose intrinsic dimension is vastly smaller than the ambient dimension. This is known as the `latent (smooth) manifold', and the hypothesis of its existence is known as `manifold hypothesis'. There is a vast literature on how to detect and describe the latent manifold in a dataset. However, there is plenty of empirical evidence that the manifold hypothesis is too restrictive: data sets do tend to group around lower dimensional objects, but these may be not `nice' as smooth manifold.
The main achievement of QuRe-ViMal is to develop techniques to use the Q-GMT tools to detect the presence of latent Quantitative Rectifiable sets in datasets, with consequences both in unsupervised and supervised learning. These tools, however, had to be developed and deepened. This is the other main achievement of QuRe-ViMal.
The former is a branch of pure mathematics. It grew out of purely theoretical questions in complex analysis and it deals with understanding the geometry of sets and measures in Euclidean and more general spaces. Imagine for example a two-dimensional surface in the three dimensional Euclidean space. Its degree of smoothness is one of its key properties: for example, it determines whether this surface represents an amenable domain where to solve PDEs, compute derivatives or integrals. Creating holes and sharp corners in a smooth surface turn it rough: these are features which, if too abundantly present, prevent doing analysis and PDEs. The main focus of Q-GMT is then to study the geometry of possibly very high dimensional objects (sets and measures) via quantifying the presence of holes and corners. In the last twenty years Q-GMT proved a very powerful tool in extending analysis and PDEs to a much larger class of `surfaces' (of arbitrary dimension), which are known as Quantitatively Rectifiable sets (or measures). In other words: smoothness is not needed. What is necessary, instead, is a precise quantification and control of holes and corners.
The latter research area, STL, is tasked to formalise when a model constructed out of observations, that is, data sets, has predictive capacities with respect to the phenomenon observed. For example, it gives criteria to check whether an interpolating function is overfitting or underfitting. STL and machine learning are afflicted by the so-called curse of dimensionality: the fact that the computational costs scale exponentially with the dimension of the dataset. What saves the day is that more often than not data points coagulate near geometric objects, such as smooth submanifolds, whose intrinsic dimension is vastly smaller than the ambient dimension. This is known as the `latent (smooth) manifold', and the hypothesis of its existence is known as `manifold hypothesis'. There is a vast literature on how to detect and describe the latent manifold in a dataset. However, there is plenty of empirical evidence that the manifold hypothesis is too restrictive: data sets do tend to group around lower dimensional objects, but these may be not `nice' as smooth manifold.
The main achievement of QuRe-ViMal is to develop techniques to use the Q-GMT tools to detect the presence of latent Quantitative Rectifiable sets in datasets, with consequences both in unsupervised and supervised learning. These tools, however, had to be developed and deepened. This is the other main achievement of QuRe-ViMal.
Lavoro eseguito dall’inizio del progetto fino alla fine del periodo coperto dalla relazione e principali risultati finora ottenuti
Topic A. Statistical learning theory.
Most often in natural science the following happens: a) We observe a phenomenon. b) Through these observation a model of this phenomenon is constructed. c) We (try to) make prediction concerning this phenomenon through our model. Statistical learning theory gives a sort of meta-modelling of this process: it formalises in mathematical terms the notion of "phenomenon", of "observations", and of "model", thus of prediction.
We briefly describe this learning model: a phenomenon is construed as an unknown probability distribution P lying in the Euclidean space of dimension n, or in Hilbert space. The observable are understood as a set of N realisations of N independent random variables, identically distributed according to P.
This learning model is in fact considered in two very distinct flavours: supervised and unsupervised learning. The former describes the situation in which we observe a phenomenon with an input variable and an output variable. Thus, the observation are understood to be pairs z_i=(x_i,y_i). Constructing a "model" of such phenomenon, then, means finding a function f from the space X of inputs to the space of outputs Y so that, given a previously unseen input x^*, predicts a new output y^*.
The latter, on the other hand, describe the situation where only the inputs (x_i) are given. The goal, then, is to find some geometric structure within the data. So for example, it may be found that the observations are all close to some lower dimensional manifold. The point, then, is to predict where the next observation will lie.
Unsupervised learning is key to supervised learning, as we now describe. To effectively learn a function (belonging to e.g. the Lipschitz or Sobolev class) in an ambient space of dimension n up to an error e, the cardinality N of observation required grows exponentially with n: will be of the order e^(-n), a massive number when n is very large - as common in modern data sets. This phenomenon is known as the curse of dimensionality. What often saves the day is that the inputs are actually close to some d-dimensional geometric object (e.g. an affine d-plane, a d-manifold), where d<
Indeed, in the last 30 years, a vast array of methodologies, collectively known as "manifold learning", were developed to perform linear or non-linear dimensionality reduction. They are effectively used to study high dimensional data under the assumption these lie along a d-dimensional manifold (d<
There is some empirical evidence that the manifold hypothesis holds true. However, how one could actually ascertain such hypothesis was unclear until the recent landmark work of Fefferman et al., "Testing the manifold hypothesis" (JAMS). Given a dimension d, a volume bound V and a reach value R, the authors develop here an algorithm which answers yer or no to the question: is there a d-dimensional smooth manifold with volume bound V and reach R which is close to the unknown probability measure P up to a small error (denoted by E)?
That's quite amazing. However, the manifold hypothesis itself is quite restrictive. Let us give some example: there's plenty of commonly used image data set which actually lie on a union of manifold, with varying dimension (see Brown et al., "The union of manifold hypothesis". David Donoho (Stanford) suggested back in 2000 that there was urgent need of methods capable of finding geometric structure beyond manifolds. He mention, for example, "the problem of detecting filaments in noisy data". Or "the problem of recovering filamentary structure in 3-D biological data" (see D. Donoho, "High-dimensional data analysis: the curse and blessing of dimensionality", 2000). To this end, he points to the work of G. David and S. Semmes and that of P. Jones. This is where our own expertise comes in. Indeed, quantitative rectifiability (QR from now on) stems from the very works that Donoho cites. However, its tools are now way more powerful than in 2000. The first problem we successfully tackled, then, is to follow Donoho suggestion and apply tools from QR to unsupervised and supervised learning. More precisely:
1. Unsupervised learning: we develop a test to check whether an unknown probability measure in Euclidean or Hilbert space is close to a quantitatively rectifiable set.
2. Supervised learning: we develop a test to check whether the unknown function (or model) is close to a Sobolev function defined on a (lower dimensional) QR set.
Theme B. Quantitative differentiability in the geometric setting.
To carry out the supervised learning part, we had to develop the theory of quantitative differentiability on QR sets. This essentially consisted in tackling three problems.
3. Prove a Dorronsoro theorem on sets supporting a Poincare inequality (PI).
4. Prove that any QR set may be embedded into a set supporting PI.
5. Prove the WALA conjecture, which corresponds to a converse to points 1 and 2.
Let us remark that, really, only points 1 and 2 are very relevant to learning. The third point has, however, a very high theoretical importance, as we will explain.
To interpolate outputs with Sobolev functions along a QR set, we need to define function spaces measuring smoothness which make sense in this setting. To this end, two key difficulties arise:
a) We need to define a Sobolev norm which may be computed on the discrete outputs. The point will then be to control the norm of the interpolant by the norm of the outputs (to avoid overfitting).
b) We need this norm to define a Sobolev space on a d-dimensional QR sets in Euclidean space of dimension n, or even Hilbert space.
Dorronsoro theorem is a quantification of Rademacher's. While the latter shows that any Lipschitz function is close to being affine at (arbitrarily) small scales, the former quantifies the scales where affine approximation fails. Dorronsoro's theorem, importantly, holds for Sobolev function W1p, for 1
Such a quantification is done via a square function which encapsulate information from all scale and location: on each ball B(x,r) centered on the set, it computes the average of |f-A|/r with respect to the d-Hausdorff measure on the set S; then, it sums up this averages over all balls B(x,r). For a Sobolev function f, denote this square function as G(f). As it turns out, the Lp homogeneous norm of the gradient of f is comparable to the Lp norm of G(f).
Thus, in place of computing the gradient of f, it suffices to compute G(f). But this is what we are after: since we are dealing with (discrete) samples, we need a way to compute a Sobolev norm at coarse scales, up to the resolution of the data. G(f) may be computed in this case. Thus this solves issue a) mentioned above.
There remains issue b) however: Dorronsoro's theorem holds in R^n. Has issue b) states, however, we need to prove a version that holds on d-dimensional QR set. This is the second problem which we successfully tackled during the project. More precisely, we showed that:
3. A form of Dorronsoro theorem holds on Ahlfors d-regular subsets of the n-dimensional Euclidean space which supports a PI.
4. Any Ahlfors d-regular QR set in the n-dimensional Euclidean space may be contained in a surface which supports a PI.
3-4 together essentially solve issue b). Thus, the solution of a) and b) together let us define a norm which characterises Sobolev spaces on Ahlfors d-regular QR sets and which moreover are computable at coarse scales.
Finally, during the project we completely solved the so-called WALA conjecture, which is essentially asking a converse to 3) and 4): if a form of Dorronsoro's theorem holds on an Ahlfors d-regular set, then such a set must be QR.
To summarise: results 1 - 4 are a first step toward introducing quantitative and multiscale techniques in the realm of statistical learning. This, we believe, opens up a whole new multidisciplinar research direction. The proof of the WALA conjecture (5) it's a major theoretical achievement. This conjecture was indeed posed at the beginning of the '90. It moreover opens the way to developing a satisfactory theory of quantitative differentiability Euclidean and metric spaces, with an eye on the geometric structure of these.
Most often in natural science the following happens: a) We observe a phenomenon. b) Through these observation a model of this phenomenon is constructed. c) We (try to) make prediction concerning this phenomenon through our model. Statistical learning theory gives a sort of meta-modelling of this process: it formalises in mathematical terms the notion of "phenomenon", of "observations", and of "model", thus of prediction.
We briefly describe this learning model: a phenomenon is construed as an unknown probability distribution P lying in the Euclidean space of dimension n, or in Hilbert space. The observable are understood as a set of N realisations of N independent random variables, identically distributed according to P.
This learning model is in fact considered in two very distinct flavours: supervised and unsupervised learning. The former describes the situation in which we observe a phenomenon with an input variable and an output variable. Thus, the observation are understood to be pairs z_i=(x_i,y_i). Constructing a "model" of such phenomenon, then, means finding a function f from the space X of inputs to the space of outputs Y so that, given a previously unseen input x^*, predicts a new output y^*.
The latter, on the other hand, describe the situation where only the inputs (x_i) are given. The goal, then, is to find some geometric structure within the data. So for example, it may be found that the observations are all close to some lower dimensional manifold. The point, then, is to predict where the next observation will lie.
Unsupervised learning is key to supervised learning, as we now describe. To effectively learn a function (belonging to e.g. the Lipschitz or Sobolev class) in an ambient space of dimension n up to an error e, the cardinality N of observation required grows exponentially with n: will be of the order e^(-n), a massive number when n is very large - as common in modern data sets. This phenomenon is known as the curse of dimensionality. What often saves the day is that the inputs are actually close to some d-dimensional geometric object (e.g. an affine d-plane, a d-manifold), where d<
Indeed, in the last 30 years, a vast array of methodologies, collectively known as "manifold learning", were developed to perform linear or non-linear dimensionality reduction. They are effectively used to study high dimensional data under the assumption these lie along a d-dimensional manifold (d<
There is some empirical evidence that the manifold hypothesis holds true. However, how one could actually ascertain such hypothesis was unclear until the recent landmark work of Fefferman et al., "Testing the manifold hypothesis" (JAMS). Given a dimension d, a volume bound V and a reach value R, the authors develop here an algorithm which answers yer or no to the question: is there a d-dimensional smooth manifold with volume bound V and reach R which is close to the unknown probability measure P up to a small error (denoted by E)?
That's quite amazing. However, the manifold hypothesis itself is quite restrictive. Let us give some example: there's plenty of commonly used image data set which actually lie on a union of manifold, with varying dimension (see Brown et al., "The union of manifold hypothesis". David Donoho (Stanford) suggested back in 2000 that there was urgent need of methods capable of finding geometric structure beyond manifolds. He mention, for example, "the problem of detecting filaments in noisy data". Or "the problem of recovering filamentary structure in 3-D biological data" (see D. Donoho, "High-dimensional data analysis: the curse and blessing of dimensionality", 2000). To this end, he points to the work of G. David and S. Semmes and that of P. Jones. This is where our own expertise comes in. Indeed, quantitative rectifiability (QR from now on) stems from the very works that Donoho cites. However, its tools are now way more powerful than in 2000. The first problem we successfully tackled, then, is to follow Donoho suggestion and apply tools from QR to unsupervised and supervised learning. More precisely:
1. Unsupervised learning: we develop a test to check whether an unknown probability measure in Euclidean or Hilbert space is close to a quantitatively rectifiable set.
2. Supervised learning: we develop a test to check whether the unknown function (or model) is close to a Sobolev function defined on a (lower dimensional) QR set.
Theme B. Quantitative differentiability in the geometric setting.
To carry out the supervised learning part, we had to develop the theory of quantitative differentiability on QR sets. This essentially consisted in tackling three problems.
3. Prove a Dorronsoro theorem on sets supporting a Poincare inequality (PI).
4. Prove that any QR set may be embedded into a set supporting PI.
5. Prove the WALA conjecture, which corresponds to a converse to points 1 and 2.
Let us remark that, really, only points 1 and 2 are very relevant to learning. The third point has, however, a very high theoretical importance, as we will explain.
To interpolate outputs with Sobolev functions along a QR set, we need to define function spaces measuring smoothness which make sense in this setting. To this end, two key difficulties arise:
a) We need to define a Sobolev norm which may be computed on the discrete outputs. The point will then be to control the norm of the interpolant by the norm of the outputs (to avoid overfitting).
b) We need this norm to define a Sobolev space on a d-dimensional QR sets in Euclidean space of dimension n, or even Hilbert space.
Dorronsoro theorem is a quantification of Rademacher's. While the latter shows that any Lipschitz function is close to being affine at (arbitrarily) small scales, the former quantifies the scales where affine approximation fails. Dorronsoro's theorem, importantly, holds for Sobolev function W1p, for 1
Such a quantification is done via a square function which encapsulate information from all scale and location: on each ball B(x,r) centered on the set, it computes the average of |f-A|/r with respect to the d-Hausdorff measure on the set S; then, it sums up this averages over all balls B(x,r). For a Sobolev function f, denote this square function as G(f). As it turns out, the Lp homogeneous norm of the gradient of f is comparable to the Lp norm of G(f).
Thus, in place of computing the gradient of f, it suffices to compute G(f). But this is what we are after: since we are dealing with (discrete) samples, we need a way to compute a Sobolev norm at coarse scales, up to the resolution of the data. G(f) may be computed in this case. Thus this solves issue a) mentioned above.
There remains issue b) however: Dorronsoro's theorem holds in R^n. Has issue b) states, however, we need to prove a version that holds on d-dimensional QR set. This is the second problem which we successfully tackled during the project. More precisely, we showed that:
3. A form of Dorronsoro theorem holds on Ahlfors d-regular subsets of the n-dimensional Euclidean space which supports a PI.
4. Any Ahlfors d-regular QR set in the n-dimensional Euclidean space may be contained in a surface which supports a PI.
3-4 together essentially solve issue b). Thus, the solution of a) and b) together let us define a norm which characterises Sobolev spaces on Ahlfors d-regular QR sets and which moreover are computable at coarse scales.
Finally, during the project we completely solved the so-called WALA conjecture, which is essentially asking a converse to 3) and 4): if a form of Dorronsoro's theorem holds on an Ahlfors d-regular set, then such a set must be QR.
To summarise: results 1 - 4 are a first step toward introducing quantitative and multiscale techniques in the realm of statistical learning. This, we believe, opens up a whole new multidisciplinar research direction. The proof of the WALA conjecture (5) it's a major theoretical achievement. This conjecture was indeed posed at the beginning of the '90. It moreover opens the way to developing a satisfactory theory of quantitative differentiability Euclidean and metric spaces, with an eye on the geometric structure of these.
Progressi oltre lo stato dell’arte e potenziale impatto previsto (incluso l’impatto socioeconomico e le implicazioni sociali più ampie del progetto fino ad ora)
1. Unsupervised and supervised learning. This work has a potentially transformative impact. At present, there is essentially no work done on geometric objects more general than manifolds, while there is a strong empirical evidence that many dataset lie on e.g. union of manifolds (with varying dimension), filaments, etc. Thus, this is a first step in making a connection between statistical learning and geometric measure theory.
1.1. Outlook: there are several avenue for future research. a) Currently, all non-linear dimensionality reduction algorithm assume the presence of an underlying manifold. We would like to study if these algorithms may be performed under the weaker hypothesis of a QR set.
b) Our current results have geometric meaning if the underlying measure $\mu$ is assumed to satisfy some density hypothesis. This is somewhat restrictive, and we would like to improve our theory to handle completely unknown probability distributions.
c) The work on supervised learning is, to our knowledge, the very first test that's able to tell whether we may construct a model in the Sobolev class. For the moment we are able to deal with the $W^{1,p}$ class. We would like to extend the range of smoothness.
d) This theory is presented in the setting of Euclidean distance. Many data sets, however, may present an intrinsic distance which is severely distorted by embedding. We plan to pursue our results in the metric space contexts.
2. Theoretical questions. The works on Dorronsoro's theorem on UR sets and that concerning PI and UR have already found applications - precisely in the question concerning learning.
2.1. In the solution of the WALA we have developed, for the first time, quantitative techniques concerning the differentiability of Lipschitz functions and geometry of measures. We expect this to resonate in the community. As a first expected application, we believe to be able to solve the so-called Many Segment Property conjecture. If true, this would in fact provide an easy-to-compute tool to understand quantitatively notions such as Alberti representations.
1.1. Outlook: there are several avenue for future research. a) Currently, all non-linear dimensionality reduction algorithm assume the presence of an underlying manifold. We would like to study if these algorithms may be performed under the weaker hypothesis of a QR set.
b) Our current results have geometric meaning if the underlying measure $\mu$ is assumed to satisfy some density hypothesis. This is somewhat restrictive, and we would like to improve our theory to handle completely unknown probability distributions.
c) The work on supervised learning is, to our knowledge, the very first test that's able to tell whether we may construct a model in the Sobolev class. For the moment we are able to deal with the $W^{1,p}$ class. We would like to extend the range of smoothness.
d) This theory is presented in the setting of Euclidean distance. Many data sets, however, may present an intrinsic distance which is severely distorted by embedding. We plan to pursue our results in the metric space contexts.
2. Theoretical questions. The works on Dorronsoro's theorem on UR sets and that concerning PI and UR have already found applications - precisely in the question concerning learning.
2.1. In the solution of the WALA we have developed, for the first time, quantitative techniques concerning the differentiability of Lipschitz functions and geometry of measures. We expect this to resonate in the community. As a first expected application, we believe to be able to solve the so-called Many Segment Property conjecture. If true, this would in fact provide an easy-to-compute tool to understand quantitatively notions such as Alberti representations.