Skip to main content
European Commission logo print header

Bayesian nonparametric methods for networks and recommender systems

Final Report Summary - BNPNET (Bayesian nonparametric methods for networks and recommender systems)

The last few years have seen a tremendous interest in the study and understanding of complex networks. Networks consist in a set of items, called vertices, with connections between them, called edges. Examples include the World Wide Web, social networks of friendship or other connections between individuals, organisational networks, networks of citations between papers, networks of shared interests in books or music, metabolic networks, food webs, etc.

Statistical modeling of structured networks aims at building probabilistic network models characterized by a set of unknown parameters. These parameters may e.g. represent the popularity of a given node or its probability to be part of some community. Once inferred from the observed network, these parameters help to gain more insights on the structure of the network. It is of major importance for a broad number of applications ranging from social sciences to biology or information engineering. In social sciences, people are often interested in identifying communities from an observed network, such as Facebook or LinkedIn. In communication networks such as cell phone networks, network analysis may be used to detect latent terrorist cells. In computational biology, it can be used to find hidden groups in the protein-protein interaction network. This is valuable to gain better understanding of biological processes. Models can also be used to combine networks from heterogeneous data sources to improve the accuracy of predicted genetic interaction.

Network models are often used to predict missing edges/nodes in the network. Applications include predicting missing edges in a business or a terrorist network. They are also particularly useful to build recommender systems which aim at providing automated targeted recommendations to individuals based on items they like. Recommender systems, which have gained a lot of attention over the past few years thanks to the famous Netflix prize, are now ubiquitous and used by companies like Amazon, Apple or Youtube. Recommender systems are of particular economic interest in business. According to the developer of the Amazon recommendation system, which pioneered automatic recommendations, 20% of the sales of Amazon in 2002 came from personalized recommendations. In this case, given a bipartite network of customers and items representing the purchase of a given item by a given customer, we aim at providing relevant recommendations to potential buyers of a given product. One may also be interested in obtaining a market segmentation of the customers and/or products, and in identifying trends in the evolution of the popularity of products.

Realistic data models must be able to capture the salient features of real world graphs. Networks are often characterized in terms of their degree distribution. The degree of a given vertex is the number of connections of this vertex. Degree distributions of real world networks such as the Internet, telephone networks or co-authorship networks are often strongly non-Poissonian with a power-law behavior. In particular for customers/items networks, the distribution of the number of purchases of items is often heavy-tailed with a power-law behavior: most purchases only concern a small number of popular items. This is for example the case for book or music sales. Providing models that can adequately capture this power-law behavior is economically important, as useful information can be extracted from the tails of the distribution. One may be interested in identifying products with rising popularity, or niche markets with high potential. Moreover, networks typically have a very large number of vertices, and the number of edges scales linearly with the number of vertices. The degree of a given node is therefore much lower than the number of vertices, which can be considered infiinite comparatively. As an example, the Book-Crossing dataset, collected from the Book-Crossing community, contains about 1 million ratings of 278 000 users on 271 000 books, so on average less than 4 books per user. In e-commerce, the number of articles purchased by a single user is much smaller than the number of items available. We therefore need models that are scalable, and whose limit when the number of vertices becomes infinite still has remarkable statistical properties.

In this project, we have developed novel classes of statistical models for networks than can
(a) exhibit the power-law behavior of real-world networks,
(b) have interpretable parameters, that can help to get a better understanding of the structure of the network,
(c) whose large-scale properties are well understood, and
(d) for which scalable algorithms for learning the parameters of the models are available.

The main result is that the associated class of models can produce graphs which are called "sparse", where the number of connections in the network is much smaller than the maximum number of possible connections, which is what is expected from many real-world graphs. Models with this property have been proposed to deal with graphs with latent structure and dynamic graphs as well, but without associated algorithms for learning their parameters from data. We have shown that the models are able to learn and evaluate the level of sparsity of a range of real-world graphs, including social networks, biological networks or internet networks and to make more accurate predictions for graphs with these characteristics.