Statistical Foundations for Network Modelling and Inference

Informacje na temat projektu

STATNET

Identyfikator umowy o grant: 334622

Projekt został zamknięty

Data rozpoczęcia 1 Czerwca 2013

Data zakończenia 31 Maja 2017

Finansowanie w ramach

Specific programme "People" implementing the Seventh Framework Programme of the European Community for research, technological development and demonstration activities (2007 to 2013)

Koszt całkowity

€ 100 000,00

Wkład UE

€ 100 000,00

100 000,00

Koordynowany przez

UNIVERSITY COLLEGE LONDON
United Kingdom

Final Report Summary - STATNET (Statistical Foundations for Network Modelling and Inference)

StatNet is a four-year interdisciplinary research effort focused on developing statistical foundations for network modelling and inference. Modern technology means that network datasets are pervasive and fundamental in science and society. Whenever we observe entities and relations between them, either directly or induced from other data as a means of summarising sparse dependency structure, we draw inferences from network data. These datasets are growing so rapidly in complexity and dimensionality that our analysis methods must undergo a paradigm shift to keep pace. This is the motivation for StatNet.

The vision of this research is to revolutionise our mathematical understanding of network data, and to transform this understanding into tools to model and draw inferences from network data in the real world. To achieve this research vision, StatNet will discover and develop answers to the following question: What fundamental statistical principles will allow us to optimally model and make inferences from network data? This question is important because graphs and networks as data objects are pervasive and fundamental in modern science and society, seeing more and more use by the day to inform decision making in a variety of contexts. Its solution is challenging because, as models of complex and high dimensional yet sparse dependency structures, we currently lack a foundational understanding of networks as statistical objects.

Since the beginning of the project StatNet has performed work in the form of research, resulting in the following main results:

1. Publication of clustering methods for networks: D. S. Choi and P. J. Wolfe, "Co-clustering separately exchangeable network data," Annals of Statistics, vol. 42, pp. 29-63, 2014 (doi:10.1214/13-AOS1173). This article establishes the performance of stochastic blockmodels in addressing the co-clustering problem of partitioning a binary array into subsets, assuming only that the data are generated by a nonparametric process satisfying the condition of separate exchangeability. We provide oracle inequalities with rate of convergence O_P(n^(-1/4)) corresponding to profile likelihood maximization and mean-square error minimization, and show that the blockmodel can be interpreted in this setting as an optimal piecewise-constant approximation to the generative nonparametric model. We also show for large sample sizes that the detection of co-clusters in such data indicates with high probability the existence of co-clusters of equal size and asymptotically equivalent connectivity in the underlying generative process.

2. Development of nonparametric network models: P. J. Wolfe and S. C. Olhede, "Nonparametric graphon estimation" (arXiv:1309.5936). We propose a nonparametric framework for the analysis of networks, based on a natural limit object termed a graphon. We prove consistency of graphon estimation under general conditions, giving rates which include the important practical setting of sparse networks. Our results cover dense and sparse stochastic blockmodels with a growing number of classes, under model misspecification. We use profile likelihood methods, and connect our results to approximation theory, nonparametric function estimation, and the theory of graph limits.

3. Development and publication of network histogram models: S. C. Olhede and P. J. Wolfe, "Network histograms and universality of blockmodel approximation," Proceedings of the National Academy of Sciences of the USA, vol. 111, pp. 14722-14727, 2014 (doi:10.1073/pnas.1400374111). In this article we introduce the network histogram: a statistical summary of network interactions. A network histogram is obtained by fitting a stochastic blockmodel to a single observation of a network dataset. Blocks of edges play the role of histogram bins, and community sizes that of histogram bandwidths or bin sizes. Just as standard histograms allow for varying bandwidths, different blockmodel estimates can all be considered valid representations of an underlying probability model, subject to bandwidth constraints. We show that under these constraints, the mean integrated square error of the network histogram tends to zero as the network grows large, and we provide methods for optimal bandwidth selection--thus making the blockmodel a universal representation. With this insight, we discuss the interpretation of network communities in light of the fact that many different
community assignments can all give an equally valid representation of the network. To demonstrate the fidelity-versus-interpretability tradeoff inherent in considering different numbers and sizes of communities, we show an example of detecting and describing new network community
microstructure in political weblog data.

4. Development of clustering significance methods for networks. B. Franke and P. J. Wolfe, “Network modularity in the presence of covariates,” DOI arXiv:1603.01214. This manuscript characterizes the large-sample properties of network modularity in the presence of covariates, under a natural and flexible nonparametric null model. This provides for the first time an objective measure of whether or not a particular value of modularity is meaningful. In particular, our results quantify the strength of the relation between observed community structure and the interactions in a network. Our technical contribution is to provide limit theorems for modularity when a community assignment is given by nodal features or covariates. These theorems hold for a broad class of network models over a range of sparsity regimes, as well as weighted, multi-edge, and power-law networks. This allows us to assign p-values to observed community structure, which we validate using several benchmark examples in the literature. We conclude by applying this methodology to investigate a multi-edge network of corporate email interactions.

5. Development of fast counting procedures for network features: P.-A. G. Maugis, S. C. Olhede, and P. J. Wolfe, “Fast counting of medium-sized rooted subgraphs,” DOI arXiv:1701.00177. We prove that counting copies of any graph F in another graph G can be achieved using basic matrix operations on the adjacency matrix of G. Moreover, the resulting algorithm is competitive for medium-sized F: our algorithm recovers the best known complexity for rooted 6-clique counting and improves on the best known for 9-cycle counting. Underpinning our proofs is the new result that, for a general class of graph operators, matrix operations are homomorphisms for operations on rooted graphs.

6. Development of methods for network sample comparison: P.-A. G. Maugis, S. C. Olhede, C. E. Priebe, and P. J. Wolfe, “Statistical inference for network samples using subgraph counts,” DOI arXiv:1701.00505. We consider that a network is an observation, and a collection of observed networks forms a sample. In this setting, we provide methods to test whether all observations in a network sample are drawn from a specified model. We achieve this by deriving, under the null of the graphon model, the joint asymptotic properties of average subgraph counts as the number of observed networks increases but the number of nodes in each network remains finite. In doing so, we do not require that each observed network contains the same number of nodes, or is drawn from the same distribution. Our results yield joint confidence regions for subgraph counts, and therefore methods for testing whether the observations in a network sample are drawn from: a specified distribution, a specified model, or from the same model as another network sample. We present simulation experiments and an illustrative example on a sample of brain networks where we find that highly creative individuals' brains present significantly more short cycles.

7. P.-A. G. Maugis, S. C. Olhede, and P. J. Wolfe, “Topology reveals universal features for network comparison,” DOI arXiv:1705.05677. The topology of any complex system is key to understanding its structure and function. Fundamentally, algebraic topology guarantees that any system represented by a network can be understood through its closed paths. The length of each path provides a notion of scale, which is vitally important in characterizing dominant modes of system behavior. Here, by combining topology with scale, we prove the existence of universal features which reveal the dominant scales of any network. We use these features to compare several canonical network types in the context of a social media discussion which evolves through the sharing of rumors, leaks and other news. Our analysis enables for the first time a universal understanding of the balance between loops and tree-like structure across network scales, and an assessment of how this balance interacts with the spreading of information online. Crucially, our results allow networks to be quantified and compared in a purely model-free way that is theoretically sound, fully automated, and inherently scalable.

The main results achieved, summarized above, are available open-access on arXiv as referenced at the StatNet home page, www.ucl.ac.uk/statistics/people/patrickwolfe. The final results of StatNet have advanced our scientific understanding of networks as statistical data objects. By developing novel statistical foundations for network modelling and inference, StatNet has resulted in a core set of statistical fundamentals that provide both the strong theoretical underpinnings and the practical tools required to revolutionise this field.

The methods developed in StatNet are applicable to a range of important practical problems, so that they may be assessed, refined and improved while under development. This provides a direct pathway to socio-economic impact and establishes a tight coupling between the mathematical
advancements to be achieved and the important practical problems that these advancements will benefit. It also opens up new mathematical connections with other disciplines where networks play a key role, such as the life sciences, and lead directly to new techniques that impact research
users across a range of important practical applications that directly affect the health, security and economic competitiveness of the European Community populace.

By undertaking this research project at the European Community level, it benefits a number of constituencies, with academic impact across the Community ranging from the mathematics and statistics communities through to a broad range of scientific disciplines, including the social, economic, political, physical, and life sciences, and with economic and societal impact spanning the EU public and private sectors.

Final Report Summary - STATNET (Statistical Foundations for Network Modelling and Inference)

Pobierz Pobierz zawartość strony