Skip to main content
European Commission logo
English English
CORDIS - EU research results
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Measuring with no tape

Periodic Reporting for period 4 - NoTape (Measuring with no tape)

Reporting period: 2022-06-01 to 2023-05-31

The NoTape project hypothesizes that machine learning can be more efficient and easier to interpret if the underlying distance measure used for comparing data is allowed to locally adapt. For example, if data consist of observations of human body shape, the NoTape position is that it may be beneficial to allow for different distance measures when comparing thin people than when comparing heavy ones. This may sound as being a predominantly theoretical exercise, but the view has direct benefits to both society and applied branches of science.

One of the key tasks in machine learning is to learn representations of observational data that are suitable for a given task. For instance, if the machine learning system should learn to differentiate between ill and well patients, it may seek a representation in which these groups are well apart. For knowledge discovery tasks it is, however, less clear what constitutes a good representation. For instance, when analyzing biological data to design new drugs, we often do not know precisely what we are looking for, and therefore resort to learning compact representations of data in the hope that this will discard insignificant parts of data. With modern techniques based on neural networks and deep learning, this approach has become applied across disciplines, but at a cost. When compressing data into a compact representation we often observe that large distortions in our data; we see groupings of data that are purely compression artifacts, and we see that almost identical data become dissimilar in the compressed representation. To make matters worse, we often observe that if we re-run an algorithm on the same data it may recover significantly different representations. This can lead to misinterpretations of data and to phrasings of incorrect scientific hypotheses.

One of the key contributions of the NoTape project is a mathematical solution to this problem that can be easily incorporated into existing models. By allowing the distance measure of the learned representation to locally adapt it can be designed such as to compensate for compression artifacts in the representation. Statistically, this can be seen as a partial solution to the decades-old "identifiability problem" that has plagued latent variable models. We have shown that under this approach, distances between compressed representations become identical across algorithmic runs, and retain the key information of the observed data. In this view, we avoid drawing conclusions from the new representation that isn't grounded in the data, thereby limiting the risk of misinterpreting the data.

One of the biggest risks with automated decision-making by artificially intelligent agents is that they may misinterpret the data on which they are trained. By systematically removing compression artifacts from data in learned representations, we drastically limit this risk.

Mathematically, NoTape is concerned with the study of random Riemannian metrics; that is distance measures that not only change throughout space but also come with a stochastic aspect. We may think of this data as drawn on a rubber sheet that stretches and wobbles as we try to measure which observations are similar. The NoTape project has developed elementary foundations for this largely unstudied mathematical topic.
The project has focused on establishing the mathematical and algorithmic foundations for working with random Riemannian metrics. Elementary mathematical results are in place, alongside well-behaved approximation theorems that are suitable for numerical implementation.

With these developments in place, we have applied the theory in practical machine learning applications. We have shown how endowing deep generative models can learn compressed data representations that yield statistically identifiable distances by endowing the representation with a random Riemannian pull-back metric. We have shown that if the randomness of the metric is disregarded then the resulting interpretations will be systematically misleading. This statement should not be surprising, as it merely says that for a model to reflect the data it must also be able to self-reflect on its own uncertainty. It is, however, surprising how clearly this conclusion sifts through our geometric interpretation of generative models.

We have further shown how random Riemannian metrics lead to more informative statistical analyses in learned representations. This implies that machine learning methods can become more capable and expressive without requiring further data or becoming more complex. By merely leveraging the geometric information hidden within existing models we can make models better but also more interpretable.
The NoTape project brought forward the idea of random Riemannian metrics in machine learning. This is a new view in machine learning that gives geometry a more prominent role. This view has allowed us to develop models that surpass state-of-the-art in terms of expressiveness without increasing model complexity. The view therefore also limits the ever-present risk of overfitting, which we have shown several times.

The developed random geometries have been shown to have exciting applications. We have, for example, shown that representations of proteins learned in the popular variational autoencoding framework contain rich biological information that is only revealed when the representation is viewed geometrically. In particular, information regarding protein evolution becomes accessible, when the learned representation is endowed with its natural geometric structure (Nature Communications). Such findings are of paramount importance in knowledge discovery using machine learning. We have also shown that learned random Riemannian geometric representations are excellent building blocks for systems of robot control. We have shown that the associated geodesics work very well as motion skills, and can easily be adapted to e.g. avoid dynamic obstacles (R:SS Best student paper).

To make the developed theory practically useful, we have released the associated key building blocks available as open source software in the StochMan python library.
A learned representation of the beta-lactamase protein family alognside Riemannian shortest paths.