Hybrid and Interpretable Deep neural audio machines

Informations projet

HI-Audio

N° de convention de subvention: 101052978

DOI

10.3030/101052978

Date de signature de la CE 20 Juillet 2022

Date de début 1 Octobre 2022

Date de fin 30 Septembre 2027

Financé au titre de

European Research Council (ERC)

Coût total

€ 2 482 317,50

Contribution de l’UE

€ 2 482 317,50

2 482 317,50

Coordonné par

INSTITUT MINES-TELECOM
France

Periodic Reporting for period 1 - HI-Audio (Hybrid and Interpretable Deep neural audio machines)

Période du rapport: 2022-10-01 au 2025-03-31

Machine Listening, or AI for Sound, is defined as the general field of Artificial Intelligence applied to audio analysis,
understanding and synthesis by a machine. The access to ever increasing super-computing facilities, combined with the
availability of huge data repositories (although largely unannotated), has led to the emergence of a significant trend with
pure data-driven machine learning approaches. The field has rapidly moved towards end-to-end neural approaches which
aim to directly solve the machine learning problem for raw acoustic signals but often only loosely taking into account
the nature and structure of the processed data. The main consequences are that the models are 1) overly complex, require
massive amounts of data to be trained and extreme computing power to be efficient (in terms of task performance), and
2) remain largely unexplainable and non-interpretable. To overcome these major shortcomings, we believe that our prior
knowledge about the nature of the processed data, their generation process and their perception by humans should be
explicitly exploited in neural-based machine learning frameworks.
In HI-Audio, we aim to build radically new and ground-breaking hybrid deep approaches combined with parameter-
efficient and interpretable signal models, as well as perceptual, musicological and physics-based models with highly
tailored, deep neural architectures. Several breakthrough research directions are proposed in HI-Audio that exploit novel
deterministic and statistical audio and sound environment models with different architectures of neural auto-encoders.
To demonstrate the validity and potential of the hybrid models, we will target specific applications
including speech and audio scene analysis, music information retrieval and sound transformation and synthesis.

We briefly describe below the main work conducted in each objective.

Objective O1: Hybrid deep learning with differentiable models
• Music synthesis and transformation: We proposed GLA-Grad, a model which integrates a classical signal processing iterative technique into the deep learning framework. In a related effort, we introduced SpecDiff-GAN which enhances the performance of an existing GAN model by incorporating a forward diffusion process during training with a spectrally-shaped adaptive noise injection, tailored to the audio signal knowledge. In music transformation, we introduced WaveTransfer, a flexible end-to-end diffusion model that trains a single model to handle multiple musical timbre transfer pairs.
• Dereverberation under diverse supervision paradigms: We developed a deep-learning-based dereverberation method using a novel physical coherence loss which encourages the room impulse response (RIR) induced by the dereverberated output of the model to match the acoustic properties of the room in which the signal was recorded. We further demonstrated the ability of such hybrid models to be trained in an unsupervised manner using minimal acoustic information and reverberant speech.
• Model-based deep learning approaches for source separation: We proposed a novel unsupervised model-based deep learning approach to musical source separation. Each source is modelled with a vocal production model, here implemented as a differentiable parametric source-filter model. The work has been further extended into a fully differentiable model by integrating a multipitch estimator and a novel differentiable assignment module within the core model.

Objective O2 : Attention-based and model-driven neural architectures :
• Structure-informed Positional Encoding (PE) for Music Generation: We proposed a PE technique to incorporate human knowledge into attention mechanisms in order to provide the model with insights about the input signals through the multi-resolution structure of music.
• Representation learning: We proposed a weakly-supervised framework for cross-modality representation learning in the context of a newly proposed task: melody-lyrics matching (MLM). It integrates neural network encoders with a differentiable alignment algorithm in a contrastive setting which learns the representations of melody and lyrics and provide their alignment simultaneously. Specific work was also dedicated to discrete representation learning. First, we investigated a radically new approach for quantization by exploiting a sequence of random dictionaries which allow the model to use larger quantization dictionaries with unchanged complexity. Second, we proposed several strategies to obtain disentangled representations for instance in frequency bands (for bandwidth extension) or for each source in a mix-recording (for joint audio coding and source separation). We also introduced a novel method to quantify the sensitivity of pre-trained audio embeddings to common effects .

Objective O3: Statistical Reverberation and musicological models:
• A novel representation for lyrics: For MLM, we need representations that can capture the inherent relationships between melody and lyrics around conceptual pairs such as rhythm and rhyme, note duration and syllabic stress, and structure correspondence. To that aim, we introduced sylphone, a novel representation for lyrics at syllable-level activated by phoneme identity and vowel stress.
• Reverberation models: We refined several reverberation models to express them efficiently in a differentiable manner, in order to embed them in hybrid neural architectures. For instance, we embedded Polack’s reverberation model in a differentiable hybrid architecture, and expressed backpropagation from an RIR synthesized from Polack’s model to the model’s parameters.

Objective O4: Distributed music crowdsourcing database:
The main goal of this objective is to build the core software tool to gather a large, varied, multi-genre, multi-track, multi-instruments annotated music database suitable for MIR applications. The developed HI-AUDIO online platform relies on a distributed and iterative music recording paradigm to asynchronously record musicians localised at different remote individual sites. The system architecture is based on the client-server paradigm. Specific care was dedicated to data handling and usage, specifically regarding ethics, privacy, and regulations such as GDPR, as well as intellectual property and copyright. The source code will be publicly available at: https://github.com/hi-paris/hiaudio and the application is published at: https://hiaudio.fr. The platform has been demonstrated publicly.

The important results obtained in the project include
- Structure-informed Positional Encoding (PE) for Music Generation where we showed that priors informed by musical knowledge (e.g. music structure) can prove beneficial in a scarce-data setting.
- Representation learning, where we obtained excellent resutls in discrete disantangled representations for audio generation
- and the HI-AUDIO online platform, a novel online tool for building a crowdsourcing multi-track music dataset for MIR research purposes.

Periodic Reporting for period 1 - HI-Audio (Hybrid and Interpretable Deep neural audio machines)

Partager cette page Partager cette page sur les réseaux sociaux

Télécharger Télécharger le contenu de la page