Periodic Reporting for period 3 - REACH (Raising co-creativity in cyber-human musicianship)
Reporting period: 2024-01-01 to 2025-06-30
This allowed us to present novel deep-learning models, based on the well-known “Transformer” technology from Google Brain and the “Contrastive Learning” idea that lets a system learn in a self-supervised way. We could then combine several modalities of text, audio and music in the learned representation, so that it was possible to generate high quality music samples based on text description from the users, thanks to a new diffusion model, MusicLDM. These techniques have been put to the test on stage in public concerts e.g. during the music festival Improtech (improtech.ircam.fr).
AT this stage of the REACH project, we try to make a coherent sense of the different stages of listening, training, generation and interactive experience . We have proposed, following Pr. Shlomo Dubnov’s anterior work on Music Information Dynamics (MID), a novel DMID framework (Deep Music Information Dynamics), that combines quality of latent representation, as learned by deep AI frameworks on one side, with accurate prediction of changes of musical information distribution over time on the other side, into an unified theoretical framework that explains mathematically the information transferts between agents (Symmetric Transfer Entropy).
On the engineering side, we have had a sustained development activity aiming at upgrading co-creative tools and creating new ones. The REACH software eco-systems now comprises of a family of concrete computational environments which address different aspects of the musical (improvisatory) mind : Djazz includes scenario and beat structure, Somax2 is a reactive program that adapts continuously to the musician’s changes, DYCI2 is a combination of reactivity and micro-scenarios. A visual component is also being developed in order to extend the co-creative capacities to image creation and animation in a synchronized way. As an example, one of our flagship, Somax2 is a highly original tool structured around 5 mains “skills” : a latent space built once and for all by machine learning algorithms trained on a large musical data set, encoding the general harmonic and textural knowledge ; a real-time machine listening device able to segment, analyse and encode musical streams in discrete components matched against the latent space ; a discrete sequential learning model able to figure out the pattern organisation of musical streams and form a state-based memory structure; a cognitive memory model able to temporally evolve continuous enveloppes over the sequential state structure, representing the activation rate (hot spots) w/regards to ever shifting internal and external influences and viewpoints ; a set of interaction policies determining how and when to react to influences.
Finally, on the social science side, experiments were carried on the notion of acceptability applied to musical avatars computed by AI. In particular, we built an avatar of great belgian harmonicist Toots Thielemans by extracting his solos from the record "Affinity" with Bill Evans trio (1979) using our deep zero-shot extraction method, and submitted them as training data to Djazz who was then asked to regenerate new improvisations, that were then mixed back with the accompaniment. The avatar was presented at the Bruxelles Royal Library on the occasion of Toots' Centenary and drew great attention among the experts. Other important doctoral studies have been launched in order to examine how co-creative software extensions can be used over popular social networks (s.a. TikTok) to communicate with other musicians, recruit them in shared experiences, then form new communities.
Overall REACH eco-system has already produced a body of theoretical knowledge in deep structure discovery, a collection of practical co-creative tools already used in large real-life artistic applications, and a human science research environment fostering anthropologic, cognitive and social advances.
For example, our HTS-AT hierarchical audio transformer model to produce the combined latent representation of music and general audio content is remarked as one of the state-of-the-art audio classification models in more than three benchmarks, including AudioSet, ESC-50, and SpeechCommand V2. It has been widely used in many following works, including multi-modality learning, sound event detection, audio source separation, etc. CLAP contrastive language-audio pretraining model received a lot of attention from the music and audio community because of its high performance on text-to-audio retrieval tasks and high generalization ability on different downstreaming tasks. Our Zero-shot audio source separation model ignites a new topic from the conditional audio source separation task. It achieves a competitive separation performance to the state-of-the-arts on the music source separation task but it is trained completely without needing the audio source data. Our transfer entropy method SymTE gives a quantitative score on appropriateness of musical co-improvisation, and is a first step to solve the improvisation influence problem (how an improvising agent’s signal is computer from another, co-improviser agent’s signal) . Such decisions are fundamentally important in improvisation settings, where musicians are trading the precision of the momentary musical sound with the flow of musical form and the co-creation of musical discourse, and will bring considerable improvements in REACH co-creative eco-system.