Service Communautaire d'Information sur la Recherche et le Développement - CORDIS


CUIDADO Résumé de rapport

Project ID: IST-1999-20194
Financé au titre de: FP5-IST
Pays: France

Online audio/music databases: Scalable web pages crawler

The result is a prototype system, which can perform a crawling of web pages that belong to a particular topic area (musical artists, music genres, or any set of items) and which can then compute automatically similarity relations between these sets of items.

An important characteristic of the system is its personalization. It can run on a simple PC and accumulate pages in different databases that can then be reused to build similarity measures of different kinds. Specific mechanisms allow the focusing of the crawling (such as the use of search engines such as Google to get meaningful starting points).

The system has been designed and implemented with a standard database scheme that can allow scaling of the system to high volumes of data. Specific data structures have been designed to minimize the amount of information actually sored (web page are not stored entirely, only meaningful words). The detection of co-occurrences and computation of similarities has also been optimised.

An interface allows the user to specify the crawling (e.g. nb of process running simultaneously), the root pages, as well as the management of databases for storing web pages, the computation of similarities, and their export to a format understandable by the Music Browser.


Francois PACHET, (Head of Music Unit)
Tél.: +33-1-44080505
Fax: +33-1-45878750