Wspólnotowy Serwis Informacyjny Badan i Rozwoju - CORDIS

Online audio/music databases: Scalable web pages crawler

The result is a prototype system, which can perform a crawling of web pages that belong to a particular topic area (musical artists, music genres, or any set of items) and which can then compute automatically similarity relations between these sets of items.

An important characteristic of the system is its personalization. It can run on a simple PC and accumulate pages in different databases that can then be reused to build similarity measures of different kinds. Specific mechanisms allow the focusing of the crawling (such as the use of search engines such as Google to get meaningful starting points).

The system has been designed and implemented with a standard database scheme that can allow scaling of the system to high volumes of data. Specific data structures have been designed to minimize the amount of information actually sored (web page are not stored entirely, only meaningful words). The detection of co-occurrences and computation of similarities has also been optimised.

An interface allows the user to specify the crawling (e.g. nb of process running simultaneously), the root pages, as well as the management of databases for storing web pages, the computation of similarities, and their export to a format understandable by the Music Browser.

Reported by

Sony France S.A.
20-26 Rue Morel
92110 Clichy la Serenne
See on map
Śledź nas na: RSS Facebook Twitter YouTube Zarządzany przez Urząd Publikacji UE W górę