Skip to main content

Individual Three-dimensional Spatial Auditory Displays for Immersive Virtual Environments

Periodic Reporting for period 1 - ITS A DIVE (Individual Three-dimensional Spatial Auditory Displays for Immersive Virtual Environments)

Reporting period: 2019-01-01 to 2020-12-31

3D spatial auditory displays can provide accurate information about the relation between the sound source and the surrounding environment, including the listener and his/her body which acts as an additional filter. This information cannot be substituted by any other modality (e.g. visual or tactile). Nevertheless, today's spatial representation of audio tends to be simplistic, being multimodal systems currently integrated with simple stereo or surround sound.
In ITS A DIVE extremely innovative techniques for binaural sound rendering have been developed, following a multidisciplinary approach encompassing different research areas such as computer science, acoustics, and psychology. The focus of the research program has been on structural modeling of head-related transfer functions (HRTFs), i.e. a family of state-of-the-art modeling techniques that overcome the current limitations of headphone-based 3D audio systems. The customization of the HRTF model based on the user's anthropometry grants to any user a low-cost and real-time fruition of realistic individual 3D audio, previously only possible with expensive equipment and invasive recording procedures.
The main objective of the research program has been the definition and experimental validation (through subjective tests in a 3D environment) of a completely customizable structural model for binaural sound presentation, which was still missing in the literature on spatial audio. The technical focus is on the exploitation of a vast number of public HRTF databases, including custom controlled acoustical measurements, and of state-of-the-art machine learning techniques in order to customize HRTFs by incorporating prior knowledge on the relation between HRTF features and anthropometry.
The project has achieved most of its objectives and milestones, with relatively minor deviations.
The ITS A DIVE research methodology has developed in three phases: acquisition, modelling, and evaluation.
For what concerns the acquisition phase, a large number of public HRTF databases from worldwide research labs have been collected and fused in order to form a unique large set of acoustic measurements (>400 human subjects). In addition to the organization and use of these public datasets, most resources in this phase have been allocated to the collection of a new dataset of custom acoustic measurements, named the Viking HRTF dataset, in collaboration with the University of Iceland. This dataset includes full-sphere HRTFs measured on a dense spatial grid with a binaural mannequin with different artificial pinnae attached. Anthropometric data have been either collected from pre-processing of public databases or obtained with new measurements on 2D or 3D anatomical data (e.g. ear pictures, head meshes). In particular, new features related to the shape of the ear have been automatically extracted from 3D head/ear meshes, such as depth maps, edge maps – i.e. 2D representations of the most prominent pinna edges, and reflection maps – i.e. selections of mesh points that theoretically produce reflections towards the ear canal entrance.
The modelling phase has focused on a blend of traditional signal processing techniques, state-of-the-art machine learning algorithms tuned to both global and local characteristics of HRTFs, and physically inspired models of sound propagation within the ear. Each structural component has been analyzed through ad-hoc signal processing algorithms; this has been possible because some of the collected HRTF databases contain partial responses of head-only or earless mannequins. Then, since HRTFs are by design subject to high dimensionality issues due to the wide range of predictors, adequate dimensionality reduction and/or feature extraction techniques have been applied to partial HRTF data in order to obtain compact representations to be correlated to anthropometric data. Finally, the most adequate machine learning techniques, including state-of-the-art deep learning algorithms, have been applied to yield the model that better meets speed, interpretability, and accuracy requirements. This procedure has allowed the design of a complete structural HRTF model combining measured, synthesized, and selected components.
In the evaluation phase, signal-related error metrics and auditory models have been developed to compare the customized HRTFs obtained through the developed structural model against the original measured HRTFs of a number of database subjects. Indeed, a good objective correspondence between the two sets is the basis for performing subjective tests. The HRTF models have then been integrated in a 3D game in order to perform individual tests with dynamic rendering of virtual sound sources. Collected metrics from the user tests include among others localization error, degree of externalization, and an extensive user questionnaire. The final results showed that, overall, the participants performed best in the localization task with their individualized HRTF, with an increased accuracy with respect to generic HRTFs especially in the case of expert game players.
Exploitation of the action results and dissemination to the scientific community has occurred through the release of a new database of acoustical measurements as well as the source code for the structural HRTF model, and the publication of high-quality scientific papers in international peer-reviewed scientific journals and conferences. Non-commercial exploitation of both publications and code/databases, which are expected to become solid references for researchers in 3D audio and more in general in the European Sound and Music Computing (SMC) community, is expected to lead to further research in this field and further strengthening of the position of both the researcher and the host in these research communities.
No HRTF models for customized binaural audio delivery that do not make use of pre-recorded or numerically simulated HRTFs alone have ever been successfully proposed and evaluated in previous literature. By the development and evaluation of a completely customizable structural HRTF model, ITS A DIVE successfully filled such a gap.
For what concerns the technological outcomes of the project, realistic virtual auditory displays represent an innovative breakthrough for a plethora of additional application areas not envisaged in this project. Some examples of potential applications are personal cinema, teleconferencing and teleoperation systems, travel aids for the visually impaired, and computer game engines. In particular, spatial sound technologies are expected to become more and more used in computer games. Furthermore, the techniques developed in ITS A DIVE have minimal hardware requirements with respect to other technologies for immersive sound reproduction (e.g. wave-field synthesis) as well as low computational requirements at the terminal side. The above remarks are particularly relevant for mobile applications.
ITS A DIVE logo