Servicio de Información Comunitario sobre Investigación y Desarrollo - CORDIS

FP5

CUIDADO Informe resumido

Project ID: IST-1999-20194
Financiado con arreglo a: FP5-IST
País: France

Online audio/music databases: Scalable web pages crawler

The result is a prototype system, which can perform a crawling of web pages that belong to a particular topic area (musical artists, music genres, or any set of items) and which can then compute automatically similarity relations between these sets of items.

An important characteristic of the system is its personalization. It can run on a simple PC and accumulate pages in different databases that can then be reused to build similarity measures of different kinds. Specific mechanisms allow the focusing of the crawling (such as the use of search engines such as Google to get meaningful starting points).

The system has been designed and implemented with a standard database scheme that can allow scaling of the system to high volumes of data. Specific data structures have been designed to minimize the amount of information actually sored (web page are not stored entirely, only meaningful words). The detection of co-occurrences and computation of similarities has also been optimised.

An interface allows the user to specify the crawling (e.g. nb of process running simultaneously), the root pages, as well as the management of databases for storing web pages, the computation of similarities, and their export to a format understandable by the Music Browser.

Contacto

Francois PACHET, (Head of Music Unit)
Tel.: +33-1-44080505
Fax: +33-1-45878750
Correo electrónico
Síganos en: RSS Facebook Twitter YouTube Gestionado por la Oficina de Publicaciones de la UE Arriba