Raising co-creativity in cyber-human musicianship

Periodic Reporting for period 2 - REACH (Raising co-creativity in cyber-human musicianship)

Période du rapport: 2022-07-01 au 2023-12-31

Digital cultures are increasingly pushing forward a deep interweaving between human creativity and autonomous computation capabilities of surrounding environments, modeling joint human-machine action into new forms of shared reality. Co-creativity between humans and machines will bring about the emergence of distributed information structures, creating new performative situations with mixed artificial and human agents, significantly impacting human development. To this end the REACH project aims at understanding, modeling, and developing musical co-creativity between humans and machines through improvised interactions, allowing musicians to develop their skills and expand their individual and social creative potential. Indeed, improvisation is at the very heart of all human interactions, and music is a fertile ground for developing models and tools of creativity that can be generalized to human social activity. REACH studies shared musicianship occurring at the intersection of the physical, human and digital spheres as an archetype of distributed intelligence, and produces models and tools as vehicles to better understand and foster music creativity that is more and more intertwined with computation. REACH is based on the hypothesis that co-creativity in cyber-human systems results from a an emergence of coherent behaviors and structure formation resulting from cross-learning and information transfer between agents as it is inherent to complex systems. It crosses approaches from AI, musical informatics, cognitive and social sciences, and mixed reality.

We made a number of advances since the beginning of REACH in setting up a machine learning theoretical and practical framework that serves our general objective of combining generative models and interaction, effectively feeding our“Deep Structure Discovery” project package.
This allowed us to present novel deep-learning models, based on the well-known “Transformer” technology from Google Brain and the “Contrastive Learning” idea that lets a system learn in a self-supervised way. We could then combine several modalities of text, audio and music in the learned representation, so that it was possible to generate high quality music samples based on text description from the users, thanks to a new diffusion model, MusicLDM. Diffusion is a method by which music samples are progressively refined from a noisy representation to a significant distribution while integrating the language constraints. Transformers were successfully used with audio using LDM, or with polyphonic symbolic data such as multitrack MIDI (the representation of instrumental actions). These techniques have been put to the test on stage in public concerts e.g. during the music festival Improtech (improtech.ircam.fr).
We also developed novel technologies for zero-shot audio source separation model from a sound mixture, as well as singing melody extraction that pay attention to pitch, register rand harmony content, and sound localization based on auditory perception. Zero-shot allows us to work on instruments never encountered before in the training, and localization can help to separate voices. We are in effect the first (to our best knowledge) to strongly relate these “machine listening” problem and the co-creativity question, as in our major achievement, the recreation of a piece recorded long ago by famous jazzmen Toots Thielemans and Bill Evans, albeit with yet unheard solo “by Toots”.
AT this stage of the REACH project, we try to make a coherent sense of the different stages of listening, training, generation and interactive experience we experiment. We have proposed, following Pr. Shlomo Dubnov’s anterior work on Music Information Dynamics (MID), a novel DMID framework (Deep Music Information Dynamics), that combines quality of latent representation, as learned by deep AI frameworks on one side, with accurate prediction of changes of musical information distribution over time on the other side, into an unified theoretical framework that explains mathematically the information transferts between agents (Symmetric Transfer Entropy).
These fundamental advances have laid solid foundations to machine learning of improvisation models and interaction policies between musical agents, one the major objectives of REACH.

On the engineering side, we have had a sustained development activity aiming at upgrading co-creative tools and creating new ones. The REACH software eco-systems now comprises of a family of concrete computational environments which address different aspects of the musical (improvisatory) mind : Djazz includes scenario and beat structure, Somax2 is a reactive program that adapts continuously to the musician’s changes, DYCI2 is a combination of reactivity and micro-scenarios. These environments use the popular Max/MSP platform as a front-end for interaction and a Python server as a backend for AI algorithms. A visual component is also being developed in order to extend the co-creative capacities to image creation and animation in a synchronized way. As an example, one of our flagship, Somax2 is a highly original tool structured around 5 mains “skills” : a latent space built once and for all by machine learning algorithms trained on a large musical data set, encoding the general harmonic and textural knowledge ; a real-time machine listening device able to segment, analyse and encode musical streams in discrete components matched against the latent space ; a discrete sequential learning model able to figure out the pattern organisation of musical streams and form a state-based memory structure; a cognitive memory model able to temporally evolve continuous enveloppes over the sequential state structure, representing the activation rate (hot spots) w/regards to ever shifting internal and external influences and viewpoints ; a set of interaction policies determining how and when to react to influences. Somax2 and its siblings have gained significant momentum, being “played” in major festivals and concert venues in live interaction with world-class musicians (an example is Ircam Manifeste Festival at Centre Pompidou Concert with contemporary bassist — and life-time achievement award— Joëlle Léandre and rock band Horse Lords, in June 2023, and Improtech International Workshop-Festival in Uzeste, France, in august 2023).
We have collaborated with the company HyVibe to develop a mixed-reality extension of Somax2 based on their “smart” acoustic guitar (sold over 10.000 units). We prototyped a simplified version of the co-improvisation system on a Raspberry PI micro-controller in order to test it as an embedded mixed-reality extension of the instrument itself, and as an external guitarist pedal as well, that we coined “creative-looper”. These devices “augments” the guitarist’s live play by continuously layering improvised loops, chords, or solos they have trained for, thus providing a unique creative improvisation or composition tool.
Finally, on the social science side, experiments were carried on the notion of acceptability applied to musical avatars computed by AI. In particular, we built an avatar of great belgian harmonicist Toots Thielemans by extracting his solos from the record "Affinity" with Bill Evans trio (1979) using our deep zero-shot extraction method, and submitted them as training data to Djazz who was then asked to regenerate new improvisations, that were then mixed back with the accompaniment. The avatar was presented at the Bruxelles Royal Library on the occasion of Toots' Centenary and drew great attention among the experts. Other important doctoral studies have been launched in order to examine how co-creative software extensions can be used over popular social networks (s.a. TikTok) to communicate with other musicians, recruit them in shared experiences, then form new communities. Experimental studies have been launched in order to measure how much of the initial embodiment in musical training data was preserved in the computational avatars, and what are the conditions of preservation of certain crucial properties of embodiment that may be transmitted via the training corpus in the learning process if it is itself produced by embodied agents.
Overall REACH eco-system has already produced a body of theoretical knowledge in deep structure discovery, a collection of practical co-creative tools already used in large real-life artistic applications, and a human science research environment fostering anthropologic, cognitive and social advances.

REACH achievements have brought novel methodologies, successfully recognized by the research community and have pushed beyond the state of the arts.
For example, our HTS-AT hierarchical audio transformer model to produce the combined latent representation of music and general audio content is remarked as one of the state-of-the-art audio classification models in more than three benchmarks, including AudioSet, ESC-50, and SpeechCommand V2. It has been widely used in many following works, including multi-modality learning, sound event detection, audio source separation, etc. CLAP contrastive language-audio pretraining model received a lot of attention from the music and audio community because of its high performance on text-to-audio retrieval tasks and high generalization ability on different downstreaming tasks. Our Zero-shot audio source separation model ignites a new topic from the conditional audio source separation task. It achieves a competitive separation performance to the state-of-the-arts on the music source separation task but it is trained completely without needing the audio source data. Our transfer entropy method SymTE gives a quantitative score on appropriateness of musical co-improvisation, and is a first step to solve the improvisation influence problem (how an improvising agent’s signal is computer from another, co-improviser agent’s signal) . Our results indicate that using SymTE beats other baselines such as distance based methods in choosing the right model depending on the musical context. An inter-disciplinary development related to SymTE is now under way in order to use it as a criterion for robot empowerment or intrinsic motivation, as a part of a larger conceptual and philosophical framework of Action-Perception loop, where the standard Reinforcement Learning frameworks is extended by considering the future expected rewards versus the processing effort as a basic behavioral principle in decision and action sequences. Such decisions are fundamentally important in improvisation settings, where musicians are trading the precision of the momentary musical sound with the flow of musical form and the co-creation of musical discourse, and will bring considerable improvements in REACH co-creative eco-system.

poster-noir-avec-logos-small.jpg

Periodic Reporting for period 2 - REACH (Raising co-creativity in cyber-human musicianship)

Partager cette page

Télécharger