Skip to main content
European Commission logo print header

Computational Analysis of Everyday Soundscapes

Periodic Reporting for period 4 - EVERYSOUND (Computational Analysis of Everyday Soundscapes)

Reporting period: 2019-11-01 to 2020-04-30

Sounds carry a large amount of information about our everyday environments and physical events that take place in them. For example, when a car is passing by, one can perceive the approximate size and speed of the car. Similar information can be obtained from many other sound sources such as humans, animals, etc. Sound can be captured easily and non-intrusively by cheap recording devices and transmitted further – for example, tens of hours of audio is uploaded to the internet every minute e.g. in the forms of YouTube videos. Extracting information from everyday sounds is easy for humans, but today’s computational audio analysis algorithms are not able to recognize individual sounds within them.
Computational analysis of everyday soundscapes will open up vast possibilities to utilize the information encoded in sound. Automatically acquired high-level descriptions of audio will enable the development of content-based multimedia query methods, which are not feasible with today's technology. Robots, smart phones, and other devices able to recognize sounds will become aware of physical events in their surround- ings. Intelligent monitoring systems will be able to automatically detect events of interest (danger, crime, etc.) and classify different sources of noises. Automatic sound recognition will allow new assistive technolo- gies for the hearing impaired, for example by visualizing sounds. Producing descriptions of audio will give new tools for geographical, social, and cultural studies to analyze human and non-human activity in urban and rural areas. Acquiring descriptions of animal sounds gives tools to analyze the biodiversity on an area cost-efficiently.
The main goal of EVERYSOUND is to develop computational methods for automatically producing high- level descriptions of general sounds encountered in everyday soundscapes. The specific objectives of EVERYSOUND related to the components that contribute to main goal of the project and the whole framework are:
O1: Production of a large-scale corpus consisting of audio material and reference descriptions from a large number of everyday contexts, for development and benchmarking computational everyday soundscape analysis methods.
O2: Development of a taxonomy for sounds in everyday environments.
O3: Development of robust pattern recognition algorithms that allow recognition of multiple of co-occurring sounds that may have been distorted by reverberation and channel effects.
O4: Development of contextual models for everyday soundscapes that will take into account relationships between multiple sources, their acoustic characteristic, and the context.

The project produced several datasets for the development of the methods as well as computational methods based on deep neural networks for sound event detection. It managed to attract a larger number of researchers to work on the topic by producing public datasets and evaluation benchmarks.
The project developed computational methods that enable detecting a large number of sound events in everyday environments. We developed the machine learning methods that enable learning models for recognizing sound events, and collected acoustic data with annotations for learning the model and evaluating them. We developed different taxonomies for characterizing sound events in everyday environments. We developed machine learning methods that enable learning from existing models, as well as allow actively asking for input from a human annotator in order to improve the accuracy models. We also developed contextual models that take into account other sound events. A significant amount of effort in the evaluation was put in publishing open datasets, benchmark tools and evaluation metrics, and organizing public evaluation campaigns to attract other researchers to work on the topic.

WP1 focused on collecting data for the development and evaluation of methods. Audio data was produced by doing real audio recordings, simulation of data, acquiring data from existing sound libraries, and combining real and simulated data. Audio annotations were done by trained annotators and by crowdsourcing.

WP2 developed a multilayer taxonomy for sound events. Annotation of sound events was done such that the label describes as closely as possible the sound in terms of sound production, using a noun to indicate the object or being that produced the sound, and a verb to indicate the action that produces the sound.

WP3 developed pattern recognition methods for detecting a large number of overlapping sound events in realistic soundscapes. We developed a Convolutional Recurrent Neural Network (CRNN) based method for sound event detection. We obtained state-of-the-art results with this method in several datasets. We also developed robust training procedures to deal with heterogeneous data sources by investigating various feature normalization techniques to deal with mismatches between datasets. We also investigated leveraging information from existing models by transfer learning.

WP4 developed contextual models for everyday sounds. The main focus of the WP was in language models for sound event detection, which would be able to use information about calculate the probability of a sound event occurring given the previous events. The WP also developed novel techniques that estimate the spatial location of sound event, in addition to its temporal activity and class label.

Project results were disseminated trough more than 50 peer reviewed scientific papers published in international journals and conferences, as well as in conference presentations. Also a tutorial and keynote talks about the project methods were presented. The results were disseminated in a national television broadcast and in a newspaper article.
The project produced several novel contributions going beyond the state of the art. These include:
-Development of convolutional recurrent neural networks for the detection of a large number of sound events in realistic environments.
-Taxonomy of sound events based on noun and verb pairs
-Several datasets for sound event detection, acoustic scene classification, joint localization and detection of sound events, and audio captioning

Additionally, we addressed several novel problems:
1) Characterizing soundscapes in terms of textual captions, which are complete sentences describing their contents.

2 Use of multichannel audio for joint detection and localization of sound events. The use of multichannel audio can lead to improved sound event detection accuracy in comparison to single-channel techniques, as well as provide location information. We investigated also tracking of sound events, where they were moving in space.

3) Interactive methods for annotating sound event databases for active learning sound event models. The use of interactive methods allows faster annotation of sound event databases, which reduces the costs of database collection. We have develop novel active learning methods that first analyze an initially unlabeled audio dataset, from which they selects sound segments for manual annotation.

4) Zero-shot learning of acoustic models by using textual labels of the classes for predicting the model. Using class labels enables learning acoustic models without any audio data collection and annotation, which enables addressing the recognition of new types of events efficiently.
EVERYSOUND developed computational methods that estimate the sound events in an audio signal