Sounds carry a large amount of information about our everyday environments and physical events that take place in them. For example, when a car is passing by, one can perceive the approximate size and speed of the car. Similar information can be obtained from many other sound sources such as humans, animals, etc. Sound can be captured easily and non-intrusively by cheap recording devices and transmitted further – for example, tens of hours of audio is uploaded to the internet every minute e.g. in the forms of YouTube videos. Extracting information from everyday sounds is easy for humans, but today’s computational audio analysis algorithms are not able to recognize individual sounds within them.
Computational analysis of everyday soundscapes will open up vast possibilities to utilize the information encoded in sound. Automatically acquired high-level descriptions of audio will enable the development of content-based multimedia query methods, which are not feasible with today's technology. Robots, smart phones, and other devices able to recognize sounds will become aware of physical events in their surround- ings. Intelligent monitoring systems will be able to automatically detect events of interest (danger, crime, etc.) and classify different sources of noises. Automatic sound recognition will allow new assistive technolo- gies for the hearing impaired, for example by visualizing sounds. Producing descriptions of audio will give new tools for geographical, social, and cultural studies to analyze human and non-human activity in urban and rural areas. Acquiring descriptions of animal sounds gives tools to analyze the biodiversity on an area cost-efficiently.
The main goal of EVERYSOUND is to develop computational methods for automatically producing high- level descriptions of general sounds encountered in everyday soundscapes. The specific objectives of EVERYSOUND related to the components that contribute to main goal of the project and the whole framework are:
O1: Production of a large-scale corpus consisting of audio material and reference descriptions from a large number of everyday contexts, for development and benchmarking computational everyday soundscape analysis methods.
O2: Development of a taxonomy for sounds in everyday environments.
O3: Development of robust pattern recognition algorithms that allow recognition of multiple of co-occurring sounds that may have been distorted by reverberation and channel effects.
O4: Development of contextual models for everyday soundscapes that will take into account relationships between multiple sources, their acoustic characteristic, and the context.
The project produced several datasets for the development of the methods as well as computational methods based on deep neural networks for sound event detection. It managed to attract a larger number of researchers to work on the topic by producing public datasets and evaluation benchmarks.