In a wide variety of computational settings, where the input data is most naturally viewed as coming from a distribution, it is often crucial to determine whether the underlying distribution satisfies various properties. Examples of such properties include whether two distributions are close or far in statistical distance, whether a joint distribution is independent, and whether a distribution has high entropy. For most such properties, standard statistical techniques which approximate the distribution lead to algorithms which use a number of samples that is nearly linear in the domain size. Until very recently, distributions over large domains, for which linear sample complexity can be daunting, have received surprisingly little attention. However, new interest in these questions comes from many directions, including data mining, research in the natural sciences, and networking algorithms. Recent results have shown that one can achieve results which are significantly more efficient than the standard techniques for the case of large domains. We propose a research program that will lead to an understanding of the sample, time and space complexity required to identify various natural properties of a probability distribution. We will focus on determining which properties can be understood with a number of samples that is sublinear in the domain size, and will lead to an understanding of the aspects of algorithm design that are specific to these constraints. The questions that will be considered range from considering the complexity of testing previously unstudied properties, understanding the complexity of approximating the distance to having a property, finding improved algorithms for important subclasses of distributions, investigating new models of distribution testing, and further understanding the relationship between the computational complexity and sample complexity.
Call for proposal
See other projects for this call