Periodic Reporting for period 3 - NoTape (Measuring with no tape)
Reporting period: 2020-12-01 to 2022-05-31
One of the key tasks in machine learning is to learn representations of observational data that are suitable for a given task. For instance, if the machine learning system should learn to differentiate between ill and well patients, it may seek a representation in which these groups are well apart. For knowledge discovery tasks it is, however, less clear what constitute a good representation. For instance, when analyzing biological data to design new drugs, we often do not know precisely what we are looking for, and therefore resort to learning compact representations of data in a hope that this will discard insignificant parts of data. With modern techniques based on neural networks and deep learning, this approach has become applied across disciplines, but at a cost. When compressing data into a compact representation we often observe that large distortions in our data; we see groupings of data that are purely compression artifact, and we see that almost identical data become dissimilar in the compressed representation. To make matters worse, we often observe that if we re-run an algorithm on the same data it may recover significantly different representations. This can lead to misinterpretations of data and to phrasings of incorrect scientific hypothesis.
One of the key contributions of the NoTape project is a mathematical solution to this problem that can be easily be incorporated into existing models. By allowing the distance measure of the learned representation to locally adapt it can be designed such as to compensate for compression artifacts in the representation. Statistically, this can be seen as a solution to the decades old "identifiability problem" that has plagued latent variable models. We have shown that under this approach compressed representation become practically identical across algorithmic runs, and retain the key information of the observed data. In this view, we avoid drawing conclusions in the new representation that isn't grounded in the data, thereby limiting the risk of misinterpreting the data.
One of the biggest risks with automated decision making by artificially intelligent agents is that they may misinterpret the data on which they are trained. By systematically removing compression artifacts from data in learned representations, we drastically limit this risk.
Mathematically, NoTape is concerned with the study of random Riemannian metrics; that is distance measures that not only change throughout space but also comes with a stochastic aspect. We may think of this data is drawn on a rubber sheet that stretches and wobbles as we try to measure which observations are similar. The NoTape project has developed elementary foundations for this largely unstudied mathematical topic.
With these developments in place, we have been able to apply the theory in practical machine learning applications. We have shown how endowing deep generative models can learn compressed data representations that are statistically indentifiable by endowing the representation with a random Riemannian pull-back metric. We have shown that if the randomness of the metric is disregarded then resulting interpretations will be systematically misleading. This statement should not be surprising, as it merely says that for a model to reflect the data it must also be able to self-reflect on its own uncertainty. It is, whoever, surprising how clear this conclusion sifts through our geometric interpretation of generative models.
We have further showed how random Riemannian metrics lead to more informative statistical analyses in learned representations. This imply that machine learning methods can become more capable and expressive without requiring further data or becoming more complex. By merely leveraging the geometric information hidden within existing models we can make models better but also more interpretable.
It is expected that the second half of the project will allow us to expand into different branches of representation learning, and will allow us to demonstrate how solving the identifiability problem is a direct benefit to disentanglement of data, estimation of causal factors, and reliable visualization. We further plan to expand our efforts on bringing our tools to applied sciences to provide more robust tools for knowledge discovery.
We are working to release software that will allow for the developed tools to be more widely applied.