Skip to main content

Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues

Periodic Reporting for period 2 - NBEB-SSP (Nonparametric Bayes and empirical Bayes for species sampling problems: classical questions, new directions and related issues)

Reporting period: 2020-09-01 to 2022-02-28

Object of research are species sampling problems, and generalizations thereof, whose importance has grown considerably in recent years driven by numerous applications in the broad areas of biological and physical sciences, as well as in the areas of machine learning, theoretical computer science and information theory. Species sampling problems form a broad class of discrete-functional estimation problems, i.e. inferential problems, and within such a field of research, we will be focussed on two main research themes:

RT1) the study of nonparametric Bayes and nonparametric empirical Bayes methodologies for classical species sampling problems, generalized species sampling problems emerging in biological and physical sciences, and question thereof in the context of optimal design of species inventories;

RT2) the use of recent mathematical tools from the theory of differential privacy to study the fundamental tradeoff between privacy protection of information, which requires to release partial data, and Bayesian learning in species sampling problems, which requires accurate data to make inference.
With regards to RT1, we started the following research lines (RL): RL1) the study of random objects that are at the basis of the Bayesian nonparametric approach to species sampling problems, i.e. exchangeable random partition structures and hierarchical and nested generalizations thereof, with emphasis on large sample asymptotic properties; RL2) the study of a novel approach to frequentist validation of Bayesian procedures in species problems, which relies on a dynamic formulation of the Wasserstein distance, and establishes numerous intriguing connections with classical problems in statistics and probability theory, e.g. the speed of mean Glivenko-Cantelli convergence, the estimation of weighted Poincaré-Wirtinger constants and Sanov large deviation principle for Wasserstein distance. RL3) the extension of nonparametric Bayes and nonparametric empirical Bayes methodologies for species sampling to the more general setting of features sampling problems and of trait allocation problems, with applications to biological sciences, e.g. cancer genomics, microbial ecology and single-cell sequencing; RL4) the development of a nonparametric empirical Bayes methodology for classical species sampling models under the assumption of power-law data, with applications to the context of human activity data, e.g. patterns of website visits, email messages, relations and interactions on social networks, password innovation, tags in annotation systems and edits of webpages; RL5) the extension of RL1), RL2) and RL3) to the more general context of multiple populations of individuals sharing species or features or traits, i.e. under the setting of partial exchangeability, with applications to precision medicine, microbiome analysis, single-cell sequencing and wildlife monitoring; RL6) the study of species sampling problems under the constraints that only small summaries of data are available through (sufficient) statistics or sketches obtained through hash functions, with connection to compressed sensing, and in general to sparse recovery with sparse matrices.

With regards to RT2, we started the following research lines (RL): RL7) the development of a nonparametric Bayes methodology and a nonparametric empirical Bayes methodology for disclosure risk assessment, which is a the basis of some modern privacy preserving mechanisms; RL8) the study, and the development, of species sampling problems within the framework of global differential privacy and the framework of local differential privacy, with respect to suitable perturbation mechanisms, e.g. noise additions, generalized randomized response and bit flipping; RL9) the development of a comprehensive theory for goodness-of-fit tests, with emphasis on the study of the power of the test, under both the framework of global differential privacy and local differential privacy.

In addition to RT1 and RT2, we started a new research theme (RT3) under which we aim at investigating the use of deep neural networks, nowadays very popular in the scientific community, in the context of species sampling problems. Preliminary results have been produced in the context of the large width and large depth asymptotic behaviour of feedforward and convolutional deep Bayesian neural networks, also in terms of contraction rates, under both Gaussian random weights and Stable random weights for the neural network. Other preliminary results concern with the training of feedforward neural networks through the popular gradient descent, leading to an interesting generalizations of the popular notion of the neural tangent kernel, and setting a link with kernel regression.
The work in the period March 2019 - February 2022, within research themes RT1, RT2 and RT3, produced numerous results that represent remarkable progresses beyond the state of the art, and also paved the way for thinking to new research directions in the context of species sampling problems and generalizations thereof. This makes the project in continuous evolution with respect to its starting point. In the next months, we will keep on working on the research projects started in the period March 2019 - February 2022. Among these research lines, RL2, RL3 and RL5 led to novel promising ideas that deserve particular attention and further investigation. With regards to RL2, we aim at revisiting the classical approach to frequentist validation of Bayesian procedures, in both the parametric and nonparametric setting, within the mathematical framework of optimal transport; we expect that such a research project will produce interesting results in statistics, and also interesting connection with the fields of probability and analysis. With regards to RL3, our results suggest the need on introducing novel classes of nonparametric prior distributions for feature/trait allocation models, which may produce more robust inferences than competitive priors currently known in the literature. With regards to RL6, it sets forth a novel methodology that arose interesting connections with the following areas: i) sketching algorithms for streaming data; ii) compressed sensing; iii) sparse recovery via sparse matrices; iv) multi-armed bandits for "better machine learning". In particular, we expect to introduce a novel methodology to deal with general species sampling problem under sketched data; promising preliminary results have been obtained in the context of the estimation of coverage probabilities.