Skip to main content
Przejdź do strony domowej Komisji Europejskiej (odnośnik otworzy się w nowym oknie)
polski polski
CORDIS - Wyniki badań wspieranych przez UE
CORDIS

The missing mathematical story of Bayesian uncertainty quantification for big data

Periodic Reporting for period 1 - BigBayesUQ (The missing mathematical story of Bayesian uncertainty quantification for big data)

Okres sprawozdawczy: 2022-08-01 do 2025-01-31

Recent years have seen a rapid increase in available information. This has created an urgent need for fast statistical and machine learning methods that can scale up to big data sets. Standard approaches, including the now routinely used Bayesian methods, are becoming computationally infeasible, especially in complex models with many parameters and large data sizes. A variety of algorithms have been proposed to speed up these procedures, but these are typically black box methods with very limited theoretical
support. In fact empirical evidence shows the potentially bad performance of such methods. This is especially concerning in real-world applications, e.g. in medicine.

In this project I shall open up the black box and provide a theory for scalable Bayesian methods combining recent, state-of-the-art techniques from Bayesian nonparametrics, empirical process theory, and machine learning. I focus on two very important classes of scalable techniques: variational and distributed Bayes. I shall establish guarantees, but also limitations, of these procedures for estimating the parameter of interest, and for quantifying the corresponding uncertainty, within a framework that will also convince outside of the Bayesian paradigm. As a result, scalable Bayesian techniques will have more accurate performance, and also better acceptance by a wider community of scientists and practitioners. The proposed research, although motivated by real world problems, is of a mathematical nature. In the analysis I consider mathematical models, which are routinely used in various fields (e.g. high-dimensional linear and logistic regressions are the work horses in econometrics or genetics). My theoretical results will provide principled new insights that can be used, for instance in multiple specific applications I am involved in, including developing novel statistical methods for understanding fundamental questions in cosmology and the early detection of dementia using multiple data sources.
My research project so far has already successfully solved several key objectives outlined in the research proposal together with additional related projects.

With my co-authors we have derived theoretical underpinning for several approximation methods which were used as black box procedures so far. We have considered the popular Gaussian Process method, which has slow computational time in case of large data sets. We have investigated the variational Bayes, distributed Bayes, Vecchia approximation and probabilistic numerics (e.g. Conjugate gradient descent and Lanczos iteration) techniques and have derived guarantees but also limitations for them. Based on our theoretical results we have provided a guidline for practitioners on how to tune their procedures to achieve optimal accuracy. Beside the theoretical results we have also implemented these methods based on the developed guidelines. The corresponding codes are available on github.

We have also developed a novel approach for linearizing non-linear inverse problems. We have already applied this technique on several standard examples but we plan to further investigate its use for more complex, pde governed inverse problems.

In another line of research we have developed a skewed version of the celebrated Bernstein-von Mises theorem. Our approach results in an order of magnitude better approximation of the posterior and the follow up result can be used to improve any symmetric approximation by inducing skewness into them. We have derived both theoretical guarantees but also empirical evidence on real world and synthetic data sets in support of our approach.

Finally, we have also derived a testing approach in distributed setting under communication constraints. Besides, we have also derived theoretical limits for testing in distributed settings and showed that our approach attains these limits, hence is optimal.
Our paper on the novel linearization method for non-linear, pde constrained inverse problems can be considered as an interesting, novel approach, which can open up future research directions. As a first step we split the non-linear statistical problem into a linear-statistical and a non-linear analytical problem. The linear statstical problems can be solved substantially faster than their non-linear counterpart. Furthermore, there is a much broader theoretical underpinning and methodological development for linear inverse problems, which can be applied in the non-linear setting following our strategy. For the non-linear analytic problem one can either analytically compute the explicit solution or use numerical methods approximating it. Combining them results in a faster approach with strong theoretical guarantees, which we have already applied on a range of pde constrained inverse problems. Examples include the time-independent Schrodinger equation (hyperbolic pde), Heat equation with absorption term (parabolic pde), Darcy’s flow problem, …etc. We plan to extend the number of examples, including for instance the Navier-Stokes equation (2d) and non-Abelian X-ray transform on surfaces. Another line of future work is to combine this approach with variational, distributed and other approximation methods to further speed up the (otherwise very time-consuming) computations.


In another line fo research, we have developed a skew symmetric version of the Laplace approximation. We have demonstrated both theoretically and numerically that it provides an order of magnitude better approximation of the posterior than the standard Gaussian one. This motivated our follow up work, where we have developed a skewness inducing factor, which can be used to any symmetric approximation. Examples include Gaussian variational Bayes, expectation propagation, Laplace approximation,…etc. We have demonstrate both theoretically and numerically that the proposed approach can indeed substantially improve the approximation of the posterior while maintaining the (almost same) computational cost as the symmetric approximation. In contrast to higher order approximations (e.g. Edgeworth expansions) our method provides a real, easy to compute and sample from density. We believe that the present approach has the potential to be combined with any standard symmetric approximation used in practice with a simple modification and hence will be implemented in standard software, e.g. INLA, ALA.
Moja broszura 0 0