Periodic Reporting for period 4 - EVERYSOUND (Computational Analysis of Everyday Soundscapes)
Período documentado: 2019-11-01 hasta 2020-04-30
Computational analysis of everyday soundscapes will open up vast possibilities to utilize the information encoded in sound. Automatically acquired high-level descriptions of audio will enable the development of content-based multimedia query methods, which are not feasible with today's technology. Robots, smart phones, and other devices able to recognize sounds will become aware of physical events in their surround- ings. Intelligent monitoring systems will be able to automatically detect events of interest (danger, crime, etc.) and classify different sources of noises. Automatic sound recognition will allow new assistive technolo- gies for the hearing impaired, for example by visualizing sounds. Producing descriptions of audio will give new tools for geographical, social, and cultural studies to analyze human and non-human activity in urban and rural areas. Acquiring descriptions of animal sounds gives tools to analyze the biodiversity on an area cost-efficiently.
The main goal of EVERYSOUND is to develop computational methods for automatically producing high- level descriptions of general sounds encountered in everyday soundscapes. The specific objectives of EVERYSOUND related to the components that contribute to main goal of the project and the whole framework are:
O1: Production of a large-scale corpus consisting of audio material and reference descriptions from a large number of everyday contexts, for development and benchmarking computational everyday soundscape analysis methods.
O2: Development of a taxonomy for sounds in everyday environments.
O3: Development of robust pattern recognition algorithms that allow recognition of multiple of co-occurring sounds that may have been distorted by reverberation and channel effects.
O4: Development of contextual models for everyday soundscapes that will take into account relationships between multiple sources, their acoustic characteristic, and the context.
The project produced several datasets for the development of the methods as well as computational methods based on deep neural networks for sound event detection. It managed to attract a larger number of researchers to work on the topic by producing public datasets and evaluation benchmarks.
WP1 focused on collecting data for the development and evaluation of methods. Audio data was produced by doing real audio recordings, simulation of data, acquiring data from existing sound libraries, and combining real and simulated data. Audio annotations were done by trained annotators and by crowdsourcing.
WP2 developed a multilayer taxonomy for sound events. Annotation of sound events was done such that the label describes as closely as possible the sound in terms of sound production, using a noun to indicate the object or being that produced the sound, and a verb to indicate the action that produces the sound.
WP3 developed pattern recognition methods for detecting a large number of overlapping sound events in realistic soundscapes. We developed a Convolutional Recurrent Neural Network (CRNN) based method for sound event detection. We obtained state-of-the-art results with this method in several datasets. We also developed robust training procedures to deal with heterogeneous data sources by investigating various feature normalization techniques to deal with mismatches between datasets. We also investigated leveraging information from existing models by transfer learning.
WP4 developed contextual models for everyday sounds. The main focus of the WP was in language models for sound event detection, which would be able to use information about calculate the probability of a sound event occurring given the previous events. The WP also developed novel techniques that estimate the spatial location of sound event, in addition to its temporal activity and class label.
Project results were disseminated trough more than 50 peer reviewed scientific papers published in international journals and conferences, as well as in conference presentations. Also a tutorial and keynote talks about the project methods were presented. The results were disseminated in a national television broadcast and in a newspaper article.
-Development of convolutional recurrent neural networks for the detection of a large number of sound events in realistic environments.
-Taxonomy of sound events based on noun and verb pairs
-Several datasets for sound event detection, acoustic scene classification, joint localization and detection of sound events, and audio captioning
Additionally, we addressed several novel problems:
1) Characterizing soundscapes in terms of textual captions, which are complete sentences describing their contents.
2 Use of multichannel audio for joint detection and localization of sound events. The use of multichannel audio can lead to improved sound event detection accuracy in comparison to single-channel techniques, as well as provide location information. We investigated also tracking of sound events, where they were moving in space.
3) Interactive methods for annotating sound event databases for active learning sound event models. The use of interactive methods allows faster annotation of sound event databases, which reduces the costs of database collection. We have develop novel active learning methods that first analyze an initially unlabeled audio dataset, from which they selects sound segments for manual annotation.
4) Zero-shot learning of acoustic models by using textual labels of the classes for predicting the model. Using class labels enables learning acoustic models without any audio data collection and annotation, which enables addressing the recognition of new types of events efficiently.