Skip to main content

Processing Large XML Data Sets: Algorithms and Limitations

Final Activity Report Summary - PROC-LXML (Processing Large XML Data Sets: Algorithms and Limitations)

In the PROC-LXML project we studied methods for searching large collections of web and XML documents. The main contributions of the project were the following:

Search engine measurement. We developed a novel technique for estimating statistical parameters of a search engine by sending random queries to the search engine and analysing their results. This technique enables measurement and benchmarking of search engines without having to rely on their cooperation. We used the new technique to estimate the size, the freshness, and the quality of major search engines. One of the papers we published about this technique won the Best Paper Award at the International World-Wide Web Conference.

Complexity of searching XML documents. We proved theoretical lower bounds on the amount of memory required to support searching over XML documents, which is significantly harder than searching over regular text documents.

Ranking in social networks. We presented two ranking algorithms for social networks: one that ranks individuals based on their degree of influence in the network, and another that ranks groups, based on how cohesive and tightly knit these groups are. A paper about the latter algorithm won an honourable mention for the Best Application Paper Award at the International Conference on Data Mining.

Detection of near-duplicate documents. We developed a highly efficient algorithm that detects near-duplicate web documents by examining only their URLs and without having to inspect their content.