Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues

Periodic Reporting for period 4 - NBEB-SSP (Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues)

Reporting period: 2023-09-01 to 2024-02-29

Object of research are species sampling problems, and generalizations thereof, whose importance has grown considerably in recent years driven by numerous applications in the broad areas of biological and physical sciences, as well as in the areas of machine learning, theoretical computer science and information theory. Species sampling problems form a broad class of discrete-functional estimation problems, i.e. inferential problems, and within such a field of research, we will be focussed on two main research themes:

RT1) the study of nonparametric Bayes and nonparametric empirical Bayes methodologies for classical species sampling problems, generalized species sampling problems emerging in biological and physical sciences, and questions thereof in the context of optimal design of species inventories;

RT2) the use of recent mathematical tools from the theory of differential privacy to study the fundamental tradeoff between privacy protection of information, which requires to release partial data, and Bayesian learning in species sampling problems, which requires accurate data to make inference.
With regards to RT1, we have developed the following research lines (RL): RL1) random objects that are at the basis of the Bayesian nonparametric approach to species sampling problems, i.e. exchangeable random partitions and hierarchical (and nested) generalizations thereof, with emphasis on their asymptotic and non-asymptotic properties (central limit theorems, Berry-Esseen bounds, large deviation and concentration inequalities); RL2) a novel approach to frequentist validation of Bayesian procedures in species problems, which relies on a dynamic formulation of the Wasserstein distance, establishing surprising connections with classical problems in mathematical statistics and probability, e.g. the speed of mean Glivenko-Cantelli convergence, the estimation of weighted Poincaré-Wirtinger constants and Sanov large deviation principle for Wasserstein distance. RL3) the extension of nonparametric Bayes and empirical Bayes methodologies for species sampling problems to the more general setting of features and traits allocation problems, with applications to biological sciences, e.g. cancer genomics, microbial ecology, single-cell sequencing, and to optimization in web-content desing; RL4) a nonparametric empirical Bayes methodology for classical species sampling models under the assumption of power-law data, with applications to the context of human activity data, e.g. patterns of website visits, email messages, relations and interactions on social networks, password innovation, tags in annotation systems and edits of webpages; RL5) the extension of RL1), RL2) and RL3) to the more general context of multiple populations of individuals sharing species or features or traits, i.e. under the setting of partial exchangeability, with applications to precision medicine, microbiome analysis, single-cell sequencing and wildlife monitoring; RL6) species sampling problems under the constraints that only small summaries of data are available through (sufficient) statistics or sketches obtained through hash functions, with connection to compressed sensing, and in general to sparse recovery with sparse matrices.

With regards to RT2, we have developed the following research lines (RL): RL7) a nonparametric Bayes methodology and a nonparametric empirical Bayes methodology for disclosure risk assessment (also known as the risk of re-identitication), which is a the basis of some modern privacy preserving mechanisms; RL8) species sampling problems within the framework of global differential privacy and the framework of local differential privacy, with respect to suitable perturbation mechanisms, e.g. Laplace and Gaussian noise additions, general exponential mechanisms, generalized randomized response and bit flipping; RL9) a comprehensive theory for goodness-of-fit tests, with emphasis on the study of the power of the test, under global differential privacy and local differential privacy. RL10) a novel randomized mechanism for releasing private data, which is based on the use of synthetic data generated from nonparametric posterior distributions, with applications to species sampling problems; RL11) a computational framework (based on Markov chain Monte Carlo) for Bayesian nonparametric estimation and clustering under local differntial privacy and global differential privacy.

Under the research themes RT1 and RT2, our work has yielded a wealth of groundbreaking results, earning several publication in top-tier journals spanning (mathematical) statistics, applied probability, and machine learning. Our innovative findings have not only made waves in print but have also been showcased by the PI and team members at premier conferences and workshops in the field, capturing the attention of leading experts and driving forward the frontiers of knowledge.

In addition to the research themes RT1 and RT2, we started a new research theme on the theory of deep Bayesian neural networks, which are nowadays very popular in the fields of statistics and machine learning. Several results have been produced in the context of the large width and large depth behaviour of feedforward deep neural networks, also in terms of contraction rates, under both Gaussian random weights and Stable (or heavy-tails) random weights for the network. Quantitaitive central limit theorems for large-width Gaussian neural networks have been also considered by relying on the use of the Stein-Malliavin calculus and second-order Poincaré inequlities. Other results concern with the fundamental problem of the training of feedforward neural networks through the gradient descent, leading to an interesting generalizations of the popular notion of the neural tangent kernel, and setting a link with estimation in kernel regression.
Our work has produced numerous results that represent progresses beyond the state of the art. Among them, the major achievements are the following: we completed the ambitious project of providing a comprehensive and up-to-date overview on Bayesian nonparametric inference for species sampling problems, which was missing in the literature; we investigated the limitation of completely random measures in Bayesian nonparametric estimation of the number of unseen features, and proposed the stable-Beta scaled process prior to address these shortcomings; we developed a nonparametric empirical Bayes approach to estimate the unseen under the assumption of power-law data, obtaining a consistent estimator for the number of unseen species, and investigating minimax lower bounds; we considered the estimation of the unknown frequency of a symbol in a stream of tokens, under the assumption that data are available only through sketches or lossy-compressed version of the data obtained through random hashing; we developed a new approach to quantify posterior contraction rates (PCRs) in Bayesian statistics, first considering the case of nondominated Bayesian models.
privacy-mcmc.jpg
My booklet 0 0