Content-based Unified Interfaces and Descriptors for Audio/music Databases available Online

Over the last years there has been considerable interest in multimedia systems that allow interactive involvement of the user int he watching or listening process. Idea such as letting the user to choose the angle of the camera dating broadcasting of live sport event, have appeared in Interactive TV applications. Today, advanced digital music players offer only limited digital effects that allow modifying certain global sound characteristics (mostly room or sound coloration effects) to match the user's listening preferences. Providing the possibility for manipulation a recording in a musically meaningful ways opens new possibilities in music entertainment, education and authoring applications. Our application, which we call 'a Virtual Mixer' allows a separate, detailed control over the volume of every single instrument in a recording. Normally, the balance between the instruments is done by the recording engineers during a Multi Track recording. Allowing the user to change the mixing parameters offers a greater involvement in the musical content and creates n enriched and active listening experience. For example, when a user listens to a musical piece that contains several instruments, he might wish to adjust the balance between the individual instruments. Furthermore, he might wish to isolate few instruments from the rest. This is done by attaching a meta-data of precise time-pitch occurrences to musical recording. The VM system consists of two main components: a robust algorithm for score alignment of polyphonic sounds and an adaptive, high resolution filtering process.

A set of extractors of music content descriptors that can be used with songs, loops or monophonic phrases. Extraction is done at the melodic-harmonic, rhythmic, and instrumental layers. Morphological description for sound samples (i.e. isolated sounds like sound effects or instrument notes) is also included. Music material is characterized in terms of tick, beat, metric, rhythmic complexity, rhythmicity, and harmonicity. The extracted descriptors are usable in music databases for expanding the retrieval capabilities towards semantic functionalities.

Automatic audio and visual summary generation of a music piece from signal analysis. Automatic Music summary generation brings two new dimensions to users for fast music browsing: - A visual representation presenting clearly the different part of music piece (for instance phrase and verse) and the parts, which are similar. - A sound "remix" which generate a short audio summary (for instance 10 seconds for a 3 minutes title), which is not just the addition of the most recurrent parts but a real combination of them giving a "global view" of the piece. This technology is expected to bring new usage not only in term of music consumption (faster browsing before purchase) but also in terms of new personal publishing and creative tools.

The result is a database layer for management of XML based, MPEG-7 audio descriptors and descriptions. This layer has been developed on top of Oracle's XML database (XDB) and extends the native XML and object-relational functionalities to natively cover all MPEG-7 specific aspects that have been identified within the CUIDADO project; particularly term management, reference management, relationship management, indexing and consistency management.

A natural way for searching a musical audio database for a song is to look for a short audio segment containing a melody from the song. Most of the existing systems are based on textual information, such as the title of the song and the name of the compose. However, people often do not remember the name of the compose and the song's title but can easily recall fragments from the soloist's melody. The task of 'query by melody' attempts to automate the music retrieval task. Given a sounds databases with hundreds or thousands of sounds, the user can submit a melody query to the application. The query is a symbolic representation of the sound (MIDI format). The user then gets as a result the list of sounds, which best fits, his query, ranked by their similarity to the query. Previous works on QBM focused on searching in MIDI database, or on monophonic recordings. Our application does not have this limitation. We search on real sounds database, which contains polyphonic sounds. When dealing with real polyphonic recordings we need to address several complication factors. Ideally, melodies can be represented as sequences fo notes, each is a pair of frequency and temporal duration. In real recordings two major sources of difficulty arise. The first is the high variability of the actual durations of notes. A melody can be performed faster or slower than the one dictated by the musical score (tempo variability). Furthermore, the tempo can vary within a single performance. The second complicating factor is the high variability of the spectrum due to many factors such as differences in tone colours (timer) of different singers/instruments. We solved these problems using state of the art algorithms. We use a generative probabilistic approach that models the temporal and spectral variations. This work was presented at SIGIR2002 " Shwartz, Dubnov, Singer and Friedman. - A robust Temporal and Spectral Modelling for Query by Melody"

The CUIDADO project aimed at developing and experimenting innovative products and services relying on a comprehensive music/audio content-based approach. This result concerns the audio identification through a highly compressed audio fingerprint. This patented algorithm has been successfully tested in the CUIDADO project on a database of several thousand of music titles of different genres and styles. Compared to others audio fingerprint technologies it is considerably more compressed and thus is allowing faster retrieval of large databases.

The exploding field of Music Information Retrieval has recently created extra pressure to the community of audio signal processing, for extracting automatically high level music descriptors. Indeed, current systems propose users with millions of music titles (e.g. the peer-to-peer systems such as Kazaa) and query functions limited usually to string matching on title names. The natural extension of these systems is content-based access, i.e. the possibility to access music titles based on their actual content, rather than on file names. Existing systems today are mostly based on editorial information (e.g. Kazaa), or metadata, which is entered manually, either by, pools of experts (e.g. All Music Guide) or in a collaborative manner (e.g. the MoodLogic). Because these methods are costly and do not allow scale up, the issue of extracting automatically high-level features from the acoustic signals is key to the success of online music access systems. Although there is a long tradition in extracting information from acoustic signals, the field of music information extraction is largely heuristic in nature. We have built a heuristic-based system for extracting automatically high-level music descriptors from acoustic signals. This approach is based on Genetic Programming that is used to build extraction functions as compositions of basic mathematical and signal processing operators. The search is guided by specialized heuristics that embody knowledge about the signal processing functions built by the system, and signal-processing patterns are used in order to control the general function extraction methods. Our system called EDS (for Extractor Discovery System) is able to provide automatically relevant extractors for audio descriptors, and to handle both regression and supervised classification problems. EDS takes as input a database of audio signals, labelled with their actual description value (typically a normalised numeric value for a regression problem, or class number for a classification problem). EDS provides as output an optimal regression [classification] model for the considered description problem, together with an executable function that predicts the description value [class] of any new signal, using this model. Running EDS consists in two parts: Firstly, EDS runs a genetic algorithm that builds automatically a population of signal processing functions, out of signal and mathematical operators. Then EDS evaluates if the functions of the population are relevant to help solving the descriptive problem on the input database, and tries to improve the functions by applying genetic transformations on them, such as mutations, insertions, deletions, crossovers, or variations of numeric constants, to build a new population of functions (next generation). This process is applied on all successive generations of functions until a perfect function is found, or the research is stopped manually. Secondly, EDS builds the descriptive model, by selecting the most relevant features found in first part, and finding the optimal combination of these features, ie the combination that provides the closest results to the actual descriptive values of the input database. Finally, EDS provides an executable function that computes the descriptive model on any audio signal, and saves the predictive result in a text file. An example of regressive problem solved by EDS is the evaluation of the Global Intensity of Music Titles, that is the subjective impression of energy that music titles convey, independently of the RMS volume level: with the same volume, a Hard-rock music title conveys more intensity than an acoustic guitar ballad with a soft voice. The input database consists in 200 musical extracts, together with their "Intensity", that has been statistically evaluated during previous perceptive tests. After running EDS, the system finally provided a regressive model of "Intensity" with an error of 11%, which is close to the statistical error of the perceptive test. The associated executable takes a wav file as input, applies the model on it, and writes its predicted Intensity value in a text file. An example of classification problem solved by EDS is the detection of Singing Voice in Polyphonic Music. The input database consists in 200 musical extracts, 100 of which are sung and 100 instrumental. After running EDS, the system finally provided a classification model of "Singing Voice" with a performance of 85% of good classifications. The associated executable takes a wav file as input, applies the model on it, and writes its predicted class (Sung or Instrumental) in a text file. EDS can be used for the automatic computation of descriptors on a large database using a small set of hand-labelled music titles. Integrated in a music browser, EDS would allow the users to specify their own relevant descriptors and compute them automatically on their music collection.

The Sound Palette Offline application offers recording, editing & mixing of existing sound samples and also contributes various signal processing features like analysis, transformation and synthesis functions and also Time stretching (compression and expansion without frequency shifting), Transposition (frequency shifting without time modification) and Filtering & morphing of sounds and audio content. Integrated part of the application will be an offline database browser based on the Oracle Database Engine to allow access to sample based sound libraries and to edit them on a PC based computer system. The Sound Palette Offline has a graphical user interface where different sound samples can be arranged separately plus various windows to perform the sound processing modules with generated display data for highly ergonomic handling of processed data. The included sound processing features can be used within music production studio environment to create new musical items, modify existing audio material or show results on similarity queries for further editing.

This tool combines music descriptions, signal processing tools (time-stretch and pitch transposition), and high-level control structures in order to change the musical content. The tool can be included in audio editors and multitracking systems in order to achieve in real time, transformation of original recording according two dimensions: pitch and time. As long as using the usual control parameters for this type of signal processing devices, other higher-level (i.e. "semantic") controls have been incorporated. This means, for example, that the length of an audio segment can be changed by dragging a graphical handle in order to make it fit a new temporal slot. Melodic changes (i.e. adding or deleting notes, changing tonality, etc.) can be achieved also with high-level controls.

The CUIDADO project aimed at developing and experimenting innovative products and services relying on a comprehensive music/audio content-based approach. The Online Sound Palette is conceived as an online system providing access, either in intranet or over the web, to large, centralized databases of sound samples. In order to provide efficient browsing and searching features among these data, the application heavily relies on descriptions of the audio material; these descriptions mostly follow the Mpeg7 formalisation. They include structured textual (taxonomic) descriptions, as well as numerical information automatically processed from the signal. Innovative functionalities such as automatic classification of samples, or similarity search, are developed on top of these descriptions of the audio content. Taking advantage of the intranet/Internet technology it relies on, the Online Sound Palette also aims at being a tool for online, collaborative work: users can share their own audio samples, on the basis of user groups and access rights definitions.

Automatic audio samples classification system based on in a finite class set trained reference collection such as the instrumental sounds Ircam Studio Online database. Database owners can implement this software in their system or products such as: - Online or local classification server sharable with authorized users. - Preparation of sound catalogues or sound collections. - Software for sound editing, classification and retrieval.

Search by similarity in sound effects is needed for musical authoring and search by content (MPEG-7) applications. Sounds such as outdoor ambience, machine noises, speech or musical excerpts as well as many other man made sound effects (so called "Folley sounds") are complex signals that have a well-perceived acoustic characteristic of some random nature. In many cases, these signals cannot be sufficiently represented based on second order statistics only and require higher order statistics for their characterization. Several methods for statistical modelling of such sounds were proposed in the literature: non-gausian linear and non-linear source-filter models using HOS, optimal basis / sparse geometrical representations using Independent Component Analysis (ICA) and methods that combine ICA-based features with temporal modelling (HMM). Our application is based on a novel algorithm, which uses ICA to compute different sound models for different sound classes. We use the Mutual Information between the sound sources to classify sounds within the models. Another application is search by similarity applications for SFX databases. We use a variat of the classification algorithm to retrieve similar sound effects from SFX database.

The CUIDADO project aimed at developing and experimenting innovative products and services relying on a comprehensive music/audio content-based approach. This result address the problem of building automatically sequences of musical items, such as music titles or sound samples, taken from a large database. Indeed, building manually interesting playlists (sequences of music titles) out of large catalogues of songs, requires a expert musical knowledge of the database that most users do not have. Similarly, musical composition using sound samples requires an expert knowledge and musical sense of the sounds at your disposal. We have implemented a general sequencing algorithm, which allows building these sequences automatically, in a computer-assisted way (see D2.5.3). This can handle both problems (items retrieval and sequence generation) at the same time: the user specifies the sequence properties and the system is able to build automatically the sequence out of musical items taken from very large databases. Indeed, the system we propose is able to scale up on databases containing more than 100.000 items, using a local search method based on constraint solving, called adaptive search. The way the sequence is built is controlled by providing high-level properties, set by the user. The properties of the sequence are translated automatically into constraints holding on descriptors of the audio items. These constraints can be generic, such as all items in the sequence should be different, or can hold on properties of specific attributes of the items called metadata, such as genre or duration for music titles, and pitch or percussivity for sound samples. They can be local, holding on a specific item in the sequence, or can be global, holding on a part or even on the whole sequence. This formulation yields a hard combinatorial problem, as soon as the size of the database gets very large. Thus filtering procedures have to be designed to find solutions in a reasonable time. We have implemented a constraint satisfaction method based on a local search technique, called adaptive search that is able to handle arbitrary complex constraints, and to scale up to large databases. It is an incomplete search algorithm that can yield approximate solutions very efficiently, and that is proven to be well adapted for musical applications. The 2 specific applications of this general sequencing algorithm are: - For sound samples: the musical mosaicing (Musaicing) of the Sound Palette. - For music titles: the playlists generator of the Music Browser.

To promote universality and accessibility of metadata we have designed a tool to design and exploit editorial metadata over the Internet. More precisely, an editorial metadata server has been set up and is useable with a simple query system. This server provides editorial metadata for artists and songs based on textual data gathered from file names, ID3 tag from mp3 files, duration of songs and manually provided information. Once a song is recognized by the server, its ID in the Cuidado database is returned. This ID can be used for several purposes including querying the server again to retrieve metadata about the recognized artist/song. The server also accepts incoming information and store them for further use. We have designed and implemented two main tools to this aim: - A php server; - A servlet server. Both servers use free middleware components -EasyPhp and Tomcat. These components allow the design of a flexible architecture in which the Cuidado server (both PhP and servlet) stores all information computed (descriptors), manually entered (editorial data) or automatically gathered (mp3 ID tags). Users do not need to store locally large amount of metadata. If computed data are available, the MusicBrowser retrieves them automatically when a song is added to the database and costly computation (distance matrix for example) are made on the Cuidado server once for all. Inputs and outputs are all based on text. Here under 2 examples of queries: Servlet: http://TheSonyServer:8080/ClosestSongs?TitleName= mannix&ArtistName= Schifrin&Duration= 90 Php: http://TheSonyServer/allArtistProp.php? id=1354. The arguments used depend on the nature of the query. Moreover, for security reasons, we cannot reveal the exact address of the server, as well as, the exact list of arguments useable for queries. Answers are text files parsed by the Music Browser. Here under two lines of answers coming from the servlet server: 17360 SCHIFRIN, Lalo mannix 3.0 2545 CREOLE, Kid & COCONUTS, The Annie, I'm Not Your Daddy 5.0 This kind of services could be available either through a complete software suite (like Music Browser) or directly with simple subscriptions. The target audiences include music lovers, broadband access subscribers, mp3 fans, music networks and on-line sound libraries.

- Description Scheme in MPEG7 XML Schema for audio timbre (instruments, effects) independent of pitch, intensity and rhythm - Related extraction software in Matlab. Dissemination: - MPEG7 industry forum. - MPEG7 patent consortium. - CUIDADO and other IST projects. - Music fairs. Innovative features: - Automatic extraction. - High accuracy in sound description. Current status: - Implemented in IRCAM Sound Palette. - Patent and software licenses available.

The CUIDADO project aimed at developing and experimenting innovative products and services relying on a comprehensive music/audio content-based approach. The purpose of this result is dissemination. In itself it has neither commercial viability nor any concrete scientific impact. The main idea is to build a simple guide to promote the use of interactive content-based Search & Retrieval tools, of which CUIDADO is in a leading position. This guide will therefore act as means to incentivise users from the general public as well as the "creative communities" OF content creators and producers to understand and use clear and open standards - such as MPEg7 and the W3C Semantic Web - when using and digitising content. The current status (June '03) is that we are waiting for the final results from the technical partners before writing this guide, which will be in the form of a CDrom and flyers.

Today, a wide range of music pieces can be found in a symbolic representation and it is available both on commercial CD's and some of it is available freely over the Internet. One example of symbolic representation is the MIDI format. On the other hand, there are the real music recordings where the original signal is stored on a digital media (e.g. CD, DVD). We developed a state of the art algorithm for alignment of score metadata (MIDI) to its associated real recording. We solved several challenging problems. One problem is the tempo variability between the MIDI file and the real recording. Imagine for example the beetles 'Yesterday'. The MIDI file contains the score information and the basic tempo. One can find several real recordings of this piece. Both by the beetles and by other performers. Our algorithm can do the alignment of the score information to all of these recording. There are wide applications for this development. For example, Query By Melody tool, where the melody is given in form of MIDI and the database contains real recordings.

This software component consists in a layer built on top of the Online Sound Palette databases. It provides high-level, user-oriented services (like: connect to the system, search for audio samples, download, audio metadata authoring, etc.) through a generic Java API. It thus allows any application compliant with Java technology to interact - possibly over the web - with the online Sound Palette databases, just as a human user would do. This layer (on which current applications, among which the online Sound Palette, already rely) is thus the basic component for interoperability of external applications with the current databases. It is foreseen as an important technical basis for technology transfer and partnership in new projects.

The result is a prototype system, which can perform a crawling of web pages that belong to a particular topic area (musical artists, music genres, or any set of items) and which can then compute automatically similarity relations between these sets of items. An important characteristic of the system is its personalization. It can run on a simple PC and accumulate pages in different databases that can then be reused to build similarity measures of different kinds. Specific mechanisms allow the focusing of the crawling (such as the use of search engines such as Google to get meaningful starting points). The system has been designed and implemented with a standard database scheme that can allow scaling of the system to high volumes of data. Specific data structures have been designed to minimize the amount of information actually sored (web page are not stored entirely, only meaningful words). The detection of co-occurrences and computation of similarities has also been optimised. An interface allows the user to specify the crawling (e.g. nb of process running simultaneously), the root pages, as well as the management of databases for storing web pages, the computation of similarities, and their export to a format understandable by the Music Browser.

Music map based on similarity relationships between different regions of the music sequence. Music Map is an innovative way for extracting and representing musical structures of MIDI files. It is intended for advanced listening and browsing through music, for music analysis purposes and for composition. Potential use of the software: - Automatic extraction of the tempo (tempo value + temporal positions of the beat) and the metrics. - Automatic extraction and representation in a similarity matrix of the motivic repetitions and variations. - Selection of a polyphonic motive and search for similar ones in any MIDI database. Softwares that can analyse polyphonic music are very rare. The model we use relies on different features such as pitch intervals, contour and rhythm, which offers multiple viewpoints on the sequence. Special importance is granted to perceptive considerations so that features specific to music are taken into account. The model has been tested on several music corpus such as jazz, classical or contemporary music. It is currently implemented in Ircam's OpenMusic environment and available to Forum Ircam members (approx. 1300 users currently). Implementation in other music software environment such as Finale or Sibelius is foreseen. A stand-alone implementation upon industrial partner of publisher specification can be started if funding is provided.

The Music Browser is a complete software suite aimed at management of collections of songs. It uses a database of songs along with editorial and computed metadata. This software suite allows users to create and edit databases of music titles. More precisely this software suite includes: - An editorial metadata management tool. This tool allows user to edit editorial information on the artists of titles in the database. The Music Browser queries a central database of 20.000 songs and 12.600 artists (this database is referred in this ETip as musical name server - 3971). When a song/artist is recognized, all associated metadata can be automatically imported to the client side. - A tool to compute extractors on the titles of the database. Extractors can be any executable programs, augmented with an information file (.inf) describing it sinput and output types. Additionally, extractors can be produced automatically from the extractor authoring tool - 3970. - A query system enabling to issue complex queries on the titles and artists, - A similarity query system: users can get similar songs according to selected parameters (timbre, genre, energy, tempo) - A powerful playlist generation panel based on constraints: the system generates playlists using rules like: I want 50% of rock songs with an increasing energy. All functions to create and manage database are available through this software suite.

Deliverables

Share this page

Download