Service Communautaire d'Information sur la Recherche et le Développement - CORDIS

Algorithms for large data sets

The Web-graph is the graph whose nodes are the (static) HTML pages and the (directed) edges are the hyperlinks between pages. This has been the subject of extensive attention because of the many applications that benefited from the analysis of the link structure of the Web, primarily Web mining. One example is represented by the algorithms for ranking pages such as Page Rank and HITS. Link analysis is also at the basis of the sociology of content creation, and the detection of structures hidden in the web (such as bipartite cores of cyber communities and web-rings.

The experimental study of the statistical and topological properties is at the core of this discipline and at the basis of the validation of stochastic graph models for the Web. To study and analyse the web-graph we need to deal with massive graph. In this deliverable we present a collection of algorithms and related implementations that are able to generate and measure massive graphs in secondary memory. This work presents external and semi- external memory algorithm we developed in order to generate and analyse Web-graphs. We define a standard file format that we use to represent both the graphs and the results of the measurement processes. The library contains routines for simulating models of stochastic graphs resembling the properties of the Web, for measuring the Page rank and degree distribution, for finding correlation between different measures, for finding connected components, cliques of small size that are considered seeds of cyber communities, for detecting the overall picture of the structure of the Web-graph.

All routines are able to compute such measure on graphs of very large size even on a medium size PC. This library has been used by different research groups in Europe that are carrying on research on the study of large complex networks, i.e. Helsinki University, Academy of Sciences of Budapest, Universita di Milano. It is actually at the best of our knowledge the only publicly available library containing a complete suite of routines for analysing large Web-graphs.

Informations connexes

Résultat en bref

Reported by

Dipartimento di Informatica e Sistemistica
via Salaria 113
00198 Roma
See on map