Service Communautaire d'Information sur la Recherche et le Développement - CORDIS

Online audio/music databases: Extractor authoring tool

The exploding field of Music Information Retrieval has recently created extra pressure to the community of audio signal processing, for extracting automatically high level music descriptors. Indeed, current systems propose users with millions of music titles (e.g. the peer-to-peer systems such as Kazaa) and query functions limited usually to string matching on title names. The natural extension of these systems is content-based access, i.e. the possibility to access music titles based on their actual content, rather than on file names. Existing systems today are mostly based on editorial information (e.g. Kazaa), or metadata, which is entered manually, either by, pools of experts (e.g. All Music Guide) or in a collaborative manner (e.g. the MoodLogic). Because these methods are costly and do not allow scale up, the issue of extracting automatically high-level features from the acoustic signals is key to the success of online music access systems.

Although there is a long tradition in extracting information from acoustic signals, the field of music information extraction is largely heuristic in nature. We have built a heuristic-based system for extracting automatically high-level music descriptors from acoustic signals. This approach is based on Genetic Programming that is used to build extraction functions as compositions of basic mathematical and signal processing operators. The search is guided by specialized heuristics that embody knowledge about the signal processing functions built by the system, and signal-processing patterns are used in order to control the general function extraction methods.

Our system called EDS (for Extractor Discovery System) is able to provide automatically relevant extractors for audio descriptors, and to handle both regression and supervised classification problems.

EDS takes as input a database of audio signals, labelled with their actual description value (typically a normalised numeric value for a regression problem, or class number for a classification problem). EDS provides as output an optimal regression [classification] model for the considered description problem, together with an executable function that predicts the description value [class] of any new signal, using this model. Running EDS consists in two parts:

Firstly, EDS runs a genetic algorithm that builds automatically a population of signal processing functions, out of signal and mathematical operators. Then EDS evaluates if the functions of the population are relevant to help solving the descriptive problem on the input database, and tries to improve the functions by applying genetic transformations on them, such as mutations, insertions, deletions, crossovers, or variations of numeric constants, to build a new population of functions (next generation). This process is applied on all successive generations of functions until a perfect function is found, or the research is stopped manually.

Secondly, EDS builds the descriptive model, by selecting the most relevant features found in first part, and finding the optimal combination of these features, ie the combination that provides the closest results to the actual descriptive values of the input database. Finally, EDS provides an executable function that computes the descriptive model on any audio signal, and saves the predictive result in a text file.

An example of regressive problem solved by EDS is the evaluation of the Global Intensity of Music Titles, that is the subjective impression of energy that music titles convey, independently of the RMS volume level: with the same volume, a Hard-rock music title conveys more intensity than an acoustic guitar ballad with a soft voice. The input database consists in 200 musical extracts, together with their "Intensity", that has been statistically evaluated during previous perceptive tests. After running EDS, the system finally provided a regressive model of "Intensity" with an error of 11%, which is close to the statistical error of the perceptive test. The associated executable takes a wav file as input, applies the model on it, and writes its predicted Intensity value in a text file.

An example of classification problem solved by EDS is the detection of Singing Voice in Polyphonic Music. The input database consists in 200 musical extracts, 100 of which are sung and 100 instrumental. After running EDS, the system finally provided a classification model of "Singing Voice" with a performance of 85% of good classifications. The associated executable takes a wav file as input, applies the model on it, and writes its predicted class (Sung or Instrumental) in a text file.

EDS can be used for the automatic computation of descriptors on a large database using a small set of hand-labelled music titles. Integrated in a music browser, EDS would allow the users to specify their own relevant descriptors and compute them automatically on their music collection.

Reported by

Sony CSL
6 rue Amyot
75005 Paris
See on map