Processing Large XML Data Sets: Algorithms and Limitations

Project Information

PROC-LXML

Grant agreement ID: 16810

Project closed

Start date 31 August 2005

End date 30 August 2007

Funded under

Human resources and Mobility in the specific programme for research, technological development and demonstration "Structuring the European Research Area" under the Sixth Framework Programme 2002-2006

Total cost

No data

EU contribution

€ 80 000,00

Coordinated by

TECHNION - ISRAEL INSTITUTE OF TECHNOLOGY
Israel

Final Activity Report Summary - PROC-LXML (Processing Large XML Data Sets: Algorithms and Limitations)

In the PROC-LXML project we studied methods for searching large collections of web and XML documents. The main contributions of the project were the following:

Search engine measurement. We developed a novel technique for estimating statistical parameters of a search engine by sending random queries to the search engine and analysing their results. This technique enables measurement and benchmarking of search engines without having to rely on their cooperation. We used the new technique to estimate the size, the freshness, and the quality of major search engines. One of the papers we published about this technique won the Best Paper Award at the International World-Wide Web Conference.

Complexity of searching XML documents. We proved theoretical lower bounds on the amount of memory required to support searching over XML documents, which is significantly harder than searching over regular text documents.

Ranking in social networks. We presented two ranking algorithms for social networks: one that ranks individuals based on their degree of influence in the network, and another that ranks groups, based on how cohesive and tightly knit these groups are. A paper about the latter algorithm won an honourable mention for the Best Application Paper Award at the International Conference on Data Mining.

Detection of near-duplicate documents. We developed a highly efficient algorithm that detects near-duplicate web documents by examining only their URLs and without having to inspect their content.

Final Activity Report Summary - PROC-LXML (Processing Large XML Data Sets: Algorithms and Limitations)

Share this page Share this page on social networks

Download Download the content of the page