European Commission logo
English English
CORDIS - EU research results

Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues

Periodic Reporting for period 3 - NBEB-SSP (Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues)

Reporting period: 2022-03-01 to 2023-08-31

Object of research are species sampling problems, and generalizations thereof, whose importance has grown considerably in recent years driven by numerous applications in the broad areas of biological and physical sciences, as well as in the areas of machine learning, theoretical computer science and information theory. Species sampling problems form a broad class of discrete-functional estimation problems, i.e. inferential problems, and within such a field of research, we will be focussed on two main research themes:

RT1) the study of nonparametric Bayes and nonparametric empirical Bayes methodologies for classical species sampling problems, generalized species sampling problems emerging in biological and physical sciences, and question thereof in the context of optimal design of species inventories;

RT2) the use of recent mathematical tools from the theory of differential privacy to study the fundamental tradeoff between privacy protection of information, which requires to release partial data, and Bayesian learning in species sampling problems, which requires accurate data to make inference.
With regards to RT1, we considered the following research lines (RL): RL1) the study of random objects at the basis of the Bayesian nonparametric approach to species sampling problems, i.e. exchangeable random partitions and hierarchical (and nested) generalizations thereof, with emphasis on their asymptotic and non-asymptotic properties (central limit theorems, Berry-Esseen bounds, large deviation and concentration inequalities); RL2) the study of a novel approach to frequentist validation of Bayesian procedures in species problems, which relies on a dynamic formulation of the Wasserstein distance, and establishes connections with classical problems in statistics and probability, e.g. the speed of mean Glivenko-Cantelli convergence, the estimation of weighted Poincaré-Wirtinger constants and Sanov large deviation principle for Wasserstein distance. RL3) the extension of nonparametric Bayes and nonparametric empirical Bayes methodologies for species sampling to the more general setting of features and traits allocation problems, with applications to biological sciences, e.g. cancer genomics, microbial ecology, single-cell sequencing, and to optimization in web-content desing; RL4) the development of a nonparametric empirical Bayes methodology for classical species sampling models under the assumption of power-law data, with applications to the context of human activity data, e.g. patterns of website visits, email messages, relations and interactions on social networks, password innovation, tags in annotation systems and edits of webpages; RL5) the extension of RL1), RL2) and RL3) to the more general context of multiple populations of individuals sharing species or features or traits, i.e. under the setting of partial exchangeability, with applications to precision medicine, microbiome analysis, single-cell sequencing and wildlife monitoring; RL6) the study of species sampling problems under the constraints that only small summaries of data are available through (sufficient) statistics or sketches obtained through hash functions, with connection to compressed sensing, and in general to sparse recovery with sparse matrices.

With regards to RT2, we considered the following research lines (RL): RL7) the development of a nonparametric Bayes methodology and a nonparametric empirical Bayes methodology for disclosure risk assessment (or risk of re-identitication), which is a the basis of some modern privacy preserving mechanisms; RL8) the study, and the development, of species sampling problems within the framework of global differential privacy and the framework of local differential privacy, with respect to suitable perturbation mechanisms, e.g. Laplace and Gaussian noise additions, generalized randomized response and bit flipping; RL9) the development of a comprehensive theory for goodness-of-fit tests, with emphasis on the study of the power of the test, under global differential privacy and local differential privacy. RL10) the study of novel mechanism for releasing private data, based on the use of synthetic data generated from nonparametric posterior distributions, with applications to species sampling problems; RL11) the developement of a computational framework (Markov chain Monte Carlo) for Bayesian nonparametric estimation and clustering under local differntial privacy and global differential privacy.

In addition to RT1 and RT2, we started a new research theme (RT3) on deep Bayesian neural networks, whcih are very popular in statistics and machine learning. Several results have been produced in the context of the large width and large depth behaviour of feedforward deep neural networks, also in terms of contraction rates, under both Gaussian random weights and Stable random weights for the network. Quantitaitive central limit theorems for large-width Gaussian neural networks have been also considered by relying on the use of the Stein-Malliavin calculus and second-order Poincaré inequlities. Other results concern with the fundamental problem of the training of feedforward neural networks through the gradient descent, leading to an interesting generalizations of the popular notion of the neural tangent kernel, and setting a link with estimation in kernel regression.
The work in the period March 2019 - August 2023, within research themes RT1, RT2 and RT3, produced numerous results that represent progresses beyond the state of the art, and also paved the way for thinking to new research directions in species sampling problems and generalizations thereof. This makes the project in continuous evolution. In the final months of the projects, we will keep on working on the research lines that we opened within RT1 and RT2. Among these research lines, RL2, RL3, RL5 and RL10 led to novel promising ideas that deserve particular attention and further investigation. With regards to RL2, we aim at revisiting the classical approach to frequentist validation of Bayesian procedures, in both the parametric and nonparametric setting, within the mathematical framework of optimal transport that we have developed; we expect that such a research line will produce interesting results in Bayesian statistics, especially in the context of inverse problems with application to imaging reconstruction, and interesting connection with the fields of probability and analysis. With regards to RL3, our results suggest the need on introducing novel classes of nonparametric prior distributions for feature/trait allocation models, which may produce more robust inferences than competitive priors currently known in the literature. With regards to RL5, it sets forth a novel methodology that arose interesting connections with the following areas: i) sketching algorithms for streaming data in connection with the problems of frequency recovery and cardinality recovery; ii) compressed sensing; iii) sparse recovery via sparse matrices; iv) multi-armed bandits for "better machine learning". We expect to further develop the proposed methodology to deal with general species sampling problem under sketched data; some results have been obtained in the context of the estimation of coverage probabilities. With regards to RL10, we aim at furher investingating the idea of obtaining private data by sampling from nonparametric posterior distributions, as well as their use in Bayesian inference.