Community Research and Development Information Service - CORDIS


Sound of Vision Report Summary

Project ID: 643636
Funded under: H2020-EU.3.1.

Periodic Reporting for period 1 - Sound of Vision (Natural sense of vision through acoustics and haptics)

Reporting period: 2015-01-01 to 2016-06-30

Summary of the context and overall objectives of the project

Sound of Vision (vision restoration through sound and haptics) will design, implement and validate an original non-invasive hardware and software system to assist visually impaired persons in understanding the environment and to navigate. The Sound of Vision solution will do this by creating and conveying auditory and haptic representations of the surrounding environment. Those representations will be created, updated and delivered to a blind person continuously and in real time.

The overall objective of the project is to develop a system that will help visually impaired persons both in perceiving the environment and in independently moving in indoor or outdoor areas, without the need for predefined tags/sensors located in the surroundings.

The process proposed by Sound of Vision consists of a series of repetitive steps. The first step generates a 3D model of the surrounding environment from the video captured in real time by the Sound of Vision device camera. In the second step, the objects from the 3D model are transformed into 3D sound and haptic sources. In the last step, the 3D sound and haptic sources are combined and conveyed to the user through specialized wearable hardware and algorithms.

During the project, three versions of functional prototypes will be developed, each of which will be extensively tested and validated. Each new version of the prototype will be improved based on the experience with the last prototype – i.e. using test results and feedback provided by visually impaired persons, training specialists and neurologists.

The end result – the Sound of Vision system – will be sold as a hardware and software solution and will include training courses to help visually impaired persons use the system. The consortium has the necessary competencies to design and develop the proposed system, test the prototype, establish training protocols and commercialize the system as an intuitive and accessible final product.

Sound of Vision is a concept that goes beyond the state of the art of visual sensory substitution systems and has the potential to become an affordable commercial product that will actually help visually impaired persons. This system can have an impressive social impact, improving the lifestyle of visually impaired persons as well as their dependence on their families and friends.

Work performed from the beginning of the project to the end of the period covered by the report and main results achieved so far

The first work package (WP1) was dedicated to the preparation of the development of the Sound of Vision system. The first consortium workshop took place in Barcelona in February 2015. At the workshop, the partners discussed and established specific tasks, such as management, inter-communication and reporting tools and procedures. The first draft of the User Requirements Document (URD) was presented to the Consortium (by UPB) and discussed by the partners.

During WP1, the consortium investigated and analysed the latest concepts and results of research relevant to Sound of Vision. A further refinement of the system requirements was performed using the results of questionnaires and interviews with end users. This work resulted in the URD. Beginning with the URD, the consortium established the design goals for the Sound of Vision system, including all the aspects related to performance, dependability, costs, usability and modularity for distributed development. Guided by these, the design work progressed iteratively and produced a high-level guide and blueprint for the implementation of the system, described in the Architectural Design Document (ADD).

A comprehensive cost-performance analysis of existing devices and equipment that could be used to implement the SOV system was carried out during Months 3–4 of the project, with contributions from all partners. The analysis resulted in a list of equipment to be purchased and evaluated by the consortium in work package 2 (WP2).
Dissemination activities were also initiated within the project. The online presence of the project was established, i.e. website and social media (Twitter and Facebook). The studies performed by the consortium in WP1 were described in scientific survey articles, which were submitted to relevant journals and conferences.

WP2 was dedicated to experimenting with and evaluating technical alternatives for the Sound of Vision system. Teams consisting of consortium members with relevant expertise were formed, focusing on the following technical fields related to the development of the SOV system: 3D image acquisition, 3D image processing, audio models, specialized prototyped headphones, haptics and virtual training/testing environments. Furthermore, the following preliminary tests and experiments with end users were performed: headphone testing and comparison, sound model testing, outdoor mobility study, EEG and physiological signal studies.

As the project pays exceptional attention to user training aspects, the design concept of the virtual training environment was also defined to a good level of detail during WP2.
Standard testing procedures for testing and evaluating the Sound of Vision prototypes were devised in WP2. These procedures included the paths and places required for the tests. The development of the standard testing procedures was an important goal of WP2, as it was crucial to the processing and interpretation of the test results of the Sound of Vision system.
The blueprint for the implementation of the Sound of Vision solution was designed and written in a Detailed Design Document (DDD). The purpose of the DDD is to guide the prototyping work in work packages WP3–WP5. The DDD was based on the URD, ADD and the results of all the WP2 experiments. However, the DDD can be modified and improved as the project advances and as the partners learn more from testing and using the prototypes.

The dissemination of concepts and results follows a general communication plan, which was established in WP1. The partners have submitted and published many conference and journal papers. All the papers are listed on the project's website. Furthermore, the online social media has also been updated regularly with news about the partners' activities within the project.
The main focus in the 3rd work package (WP3) was the development of the first prototype of the Sound of Vision system. Considering the complexity of the project and its highly innovative nature, an agile-type methodology (Scrum) was adopted in order to make the most efficient use of the available resources. A consortium meeting was held in Budapest, where the results obtained in WP2 were reviewed and used for organizing and synchronizing the Agile work in WP3.

The development work in WP3 was interspersed with further assessment of the latest technology (devices and algorithms) available after the initial evaluation performed in WP2.
In June 2016 (Month 18 of the project), the Consortium produced five units of the first version of the first prototype of the Sound of Vision device. It consists of a head gear, a haptic belt and a processing unit.

The head gear includes two 3D cameras, one for outdoor (stereo) and one for indoor (structured light), a specially prototyped multi-speaker system (that enables 3D sound localisation) and an inertial measurement unit (IMU). The multi-speaker system was designed to facilitate experiments for determining the optimum number and placement of the speakers for best sound source localisation.

The haptic belt consists of two 6x10 arrays (one located on the back and the other, made from two 6x5 arrays, located on the front) of vibrating actuators that convey the environmental information to the user. The belt enables the system to encode the information in an expressive and flexible way, e.g. one array can be used for conveying object properties, while the other can deliver spatial information at the same time. Moreover, the flexible design of this haptic system enables the Consortium to experiment with various innovative haptic languages.

The processing unit consists of a high-end ARM-based portable computer, audio and haptic controllers, batteries and storage medium. Its purpose is to run the Sound of Vision software, which integrates the 3D video acquisition, the video sequence processing and the audio and haptic encoding, delivering real-time output to the user.

At the end of June 2016 (Month 18 of the project), the consortium was putting the finishing touches on the first version of the Virtual Training/Testing Environment. It will be used for initial tests of the Sound of Vision device. The Virtual Environments provide the means to perform tests/training with end users in safe conditions. The Virtual Environments allow safe use of the SOV device during testing by using input from virtual cameras instead of real ones. Besides testing in Virtual Environments, usability tests will also be run in real world environments, according to thoroughly prepared and described testing procedures. The results will be used to further refine the first SOV prototype by the end of September (end of WP3).

Progress beyond the state of the art and expected potential impact (including the socio-economic impact and the wider societal implications of the project so far)

The huge number of persons affected by visual impairment has been impelling researchers worldwide to develop information systems to assist these people in accomplishing different activities. Attempts to replace vision using the haptic and auditory senses indicate the significance and potential of these approaches. There are systems that help visually impaired persons use computers and public transportation or to navigate in known environments (for example, museums or university areas covered with sensors or tags). However, very few systems provide a high degree of independence, allowing visually impaired persons to significantly achieve mobility and integrate in the active life. In the EU and throughout the world, many researchers believe that it is actually possible to develop such solutions, backed by technological advances in computing, sensors and neurosciences. Still, the difficulty of the task, which requires much advanced niche expertise, as well as the lack of synergies, has prevented the creation of powerful enough, easy-to-use and affordable solutions.

The overall concept and design of the Sound of Vision system are tailored to overcome the limitations of previous approaches. The development of the Sound of Vision system is based on the following key concepts, which render it an overall progress beyond the state of the art in the design of assistive technologies for the blind:
- The Multi- and inter-disciplinary approach allows us to incorporate the latest technological advances in computing, sensors and neuro and behavioural science.
- Emphasis on heavily exploiting end users' feedback: The end users are directly involved in the system design and testing phases. Their involvement is achieved through direct participation, as consortium members, of two organizations for the visually impaired, caretakers of visually impaired people in the teams of several partners, as well as large volunteer participation. This has allowed the consortium to carry out a large survey with visually impaired persons in four European countries. The conclusions of this study have been incorporated in the specifications of user requirements (D1.1), the system design (D1.3 and D2.9) as well as the definition of the testing and training procedures (D2.6, D2.8). Moreover, all these specifications have been devised and reviewed with the participation of end users and caretaker members of the consortium. Testing with end users is extensively exploited to improve the design and implementation of the incremental Sound of Vision prototypes. Testing scenarios are carefully designed by neuro and behavioural scientists, together with technical specialists, to fully exploit user feedback.
- Rich data acquisition: The Sound of Vision system incorporates multiple environment sensing technologies (stereo, structured light, inertial) to compute a full 3D reconstruction of the environment in many possible situations (indoor, outdoor, various lighting conditions, etc.).
- Rich, natural-like perception: The Sound of Vision system provides a continuous real-time reconstruction and rendering of the environment. The use of two output modalities, audio and haptic, allows the system to provide a natural perception of the environment, similar to the real visual sense. New sonification and haptic encoding models to deliver the environment information are being developed and tested.
- Focus on training: Sound of Vision regards training as the unique key for users to unlock the full potential of the device and achieve full proficiency. Gamified virtual training is the most important component, providing high availability, cost effectiveness, motivation and safety for our end users. The training mode of the device will provide a collection of serious games based on virtual environments of gradually increasing complexity. The design of these serious games was a highly innovative and creative activity, and its results – described in detail in D2.5 – represent a clear progress beyond the state of the art in the areas of serious games as well as training for/with assistive or rehabilitative solutions.
The progress beyond the state of the art may be mainly divided into the following parts:

I. Image-acquisition and 3D reconstruction of the environment
3D reconstruction of the environment geometry in real time is one of the most challenging tasks in the development of the Sound of Vision system, due to the wearability constraints imposed on the hardware devices that can be used for image acquisition and processing. This has driven the Sound of Vision team to focus their research and development efforts on designing and implementing light-weight and efficient algorithms that can still produce a rich 3D representation of the environment.

This complex task has been tackled using two separate approaches: one based on stereo vision for outdoor environments and the other based on structured light sensing for indoor environments. The development of these solutions has led to several contributions beyond the state of the art in computer vision and image understanding:

I.1 Contributions in 3D reconstruction from stereo sequences
o Ground plane detection in stereo sequences: One key element that needs to be identified in the surrounding environment of a visually impaired user is the ground plane. Its accurate detection is highly important because the user can move freely in that area. In contrast with automotive or robotics applications, which also heavily rely on ground plane extraction for obstacle detection, the camera orientation has more degrees of freedom. This leads to the requirement for designing more complex solutions for ground plane detection in the case of assistive systems for the visually impaired. The Sound of Vision team has designed and developed two solutions to tackle this problem:
 Ground plane detection with camera roll rotation detection and compensation. This method is able to compensate for roll rotation of the stereo-vision camera based only on visual data (no external IMU required). It is particularly useful in mobile applications of computer vision systems (e.g. in humanoid robots or travel aids for the visually impaired), where camera movements may affect the result of the 3D reconstruction and detection of obstacles.
 Ground plane detection with camera orientation (pitch and roll) provided by an IMU device. The proposed algorithm is based on an efficient processing of the v-disparity map associated with each frame in the stereo sequence and on a two-step decision-making approach to determine the most suited image area that corresponds to the ground plane.
o Use of multiple representations of the disparity map for obstacle detection in stereo sequences: Some research performed as part of this project aimed at providing an application-independent framework for obstacle detection based on disparity map processing. In order to identify regions in the 3D environment that can be classified as obstacles, we employ multiple representations of the disparity map: v-disparity, u-disparity and theta-disparity. We produced a comprehensive overview of these representations and their use in obstacles detection algorithms. We developed a framework for evaluation of these representations in the context of automotive and visually impaired assistive applications, using data from both real and virtual environments. The framework is application independent and is suited for real-time processing requirements. This is a consequence of approaching the obstacle detection problem in the 2D space instead of processing the 3D point cloud associated with each frame in the sequence. The V-disparity maps are processed for ground plane extraction, while the regions in the image corresponding to obstacles are detected in the U and theta representations together. The results extracted from the theta-disparity image efficiently complement the U-disparity segmentation, as the union or intersection of the two outputs can provide more accurate descriptions of the obstacles, depending on the application. Moreover, the segmentation in the theta-disparity map reveals the radial disposition of the obstacles with respect to the camera/user as well as the free navigable directions in a straightforward way.
o Dense disparity map computation on ARM-based processors: The computation of dense disparity/depth maps from stereo image pairs has traditionally been a computationally intensive task. A promising approach for solving the stereo correspondence problem in the context of the Sound of Vision project is represented by the ELAS algorithm. It offers the most convenient compromise between depth estimation accuracy and computational speed. The original algorithm is implemented in C++ and is strongly optimized for single-CPU use. Since computational speed is an important aspect, the ELAS algorithm is optimized in order to maximize speed while maintaining the same results when run on the target computational platform, i.e. a multi-core ARM processor. Due to the CPU-oriented nature of the algorithm, OpenMP is used to run multiple steps of the original, sequential algorithm, in parallel, on multiple CPU hardware threads. Furthermore, the original SSE optimization specific to x86 processors was replaced with NEON SIMD instructions for efficient exploitation of the target ARM processor. The experiments performed with nVidia Jetson TK1 and TX1 computing platforms revealed an up to 4x speedup of the proposed implementation.
o Dual-vectors-based solution for the initial guess in the motion estimation problem: The motion estimation problem is very important in multiple fields such as robotics, machine vision and automotive or assistive technologies. In the context of visual odometry techniques, the problem is to find the optimal transformation that can align two sets of 3D points. Even if, in recent years, various minimization-approaches have been proposed, these algorithms may get trapped in local minima due to the non-convexity of the problem. This issue may be overcome if the initial guess is chosen as close as possible to the true solution. A dual-vectors-based method is proposed for choosing the initial guess in the minimization procedure. This new approach is strongly connected with the parameters that can be used to describe the displacement of rigid bodies. Using an isomorphism between the special Euclidean group SE3 and the orthogonal dual tensors group SO3, a procedure was developed to compute the initial guess for such problems. The proposed method is evaluated in a dedicated framework that implements the Iterative Closest Point algorithm.

I.2 Contributions in 3D reconstruction from depth information (structured light camera)
o An inpainting algorithm used for completing the absent information from the depth map; the algorithm is useful for preprocessing the images acquired from RGBD cameras.
o An algorithm for computing normals in camera space, based on a sparse sampling.
o New segmentation algorithms. The Sound of Vision team has developed three methods for processing the depth maps obtained from a structured light camera.
- The first is a fast and simple iterative planar segmentation on the CPU which can be used not only for extracting generic objects, but also as a feature extraction step in the camera pose estimation algorithm. The innovative part in this algorithm consists of the 'likeness' cost function for two pixels which are compared and the weights given to the elements in the cost function. This function is computed based on the normal map and the depth map and represents the probability of two neighbouring pixels being part of the same region. Another advancement beyond the state of the art of this segmentation method is the inter-frame consistency, which compares regions from two subsequent frames and tries to pair them up.
- The second is a generic segmentation method based on the OpenCV connected component algorithm, which extracts edges from both normal and depth maps and builds up regions for each frame. This generic segmentation proposes an innovative manner of combining regions from subsequent frames by computing a cost function between two regions, one on the current frame and the second on the previous frame. Thus, the inter-frame consistency is ensured, and the detected objects can be tracked throughout the whole video.
- The third and most important contribution for 3D reconstruction of the environment is an efficient segmentation algorithm, which runs on GPU, based only on the depth information produced by the structured light camera. The algorithm starts from a number of seed pixels and discovers the image regions contours using the edge information extracted from depth and normal maps. The regions contours are discovered by tracing rays from the seed pixels through the image; these rays explore the image, stop when they encounter edges and connect to other rays by building a spider net. Another innovative element of this method is the inter-frame consistency implemented on the GPU. Also, this method proposes some new heuristics for ground and walls detection and for obtaining generic objects from depth map frames. The complexity of the algorithm is O(n), where n is the number of seed pixels, equivalent with O(log(max(image width, image height))). The algorithm can process 30 frames/second, which means real-time processing, on any device with the computational power of a modern mobile phone. The algorithm is parallel and scalable. It converges to the exact result in sublinear complexity with the number of image pixels.
o Detection of the ground and any other horizontal surface, based on normals and IMU data.
o Detection of any plane surface, like walls and doors, based on normals.

II. In sound-modelling and 3D sound rendering devices
The Sound Prototyping Tool is an internal tool developed to simplify the process of designing and testing sound models for auditory rendering of 3D environments. The tool allows defining complex audio processing, based on interactive and programmable events, using real-time data about a 3D environment as input. It provides a visual editor and many other facilities for fast prototyping of sonification models.

To the best of the consortium's knowledge, no such tool is available right now. The consortium intends to make the tool available as open source at the end of the project.

In the actual state of the art for the techniques of 3D spatial audio deliverance, two main streams of applications are encountered: On the one hand, cinema and the military field, 3D spatial audio is rendered through a definite number of loudspeakers, with suitable spatial techniques as WFS (Wave-Field Synthesis), Surround Sound 5.1, 6.1, 7.1, 9.1, 10.2, 11.1, 16.2, 22.2, Ambisonics, VBAP (Vector Based Amplitude Panning) and Ambiophonics. The other main stream is for video games and some applications of virtual augmented reality where the use of headphones, with advanced algorithms, such as Panorama Stereo (Stereo Panning) and HRTFs (Head Related Transfer Functions – general/individualized), are implemented in the creation of an auditory spatial perception for the user.

The present project makes a unique combination of a multi-speaker technique (composed using 3 or 4 mini-speakers per pinna-ear) that is joined with spatialization techniques used for binaural hearing as HRTFs and panning. This combination enables rendering to the user and spatialized acoustic stimuli without loss of the present environmental sound reality.

Besides, the approach to the image-to-sound mapping is as well differentiated. Models for scanning the image reality (obstacles, walls, objects in general) have been developed within the consortium. Horizontal sweep, Moving plane and Growing sphere are some of the algorithms used for obtaining cues about the distance and angular position of the object versus the user, as well as the width of the object itself. After a first recognition stage, in the current algorithm used (first prototype), distance is matched to the loudness of the signal and the width of the object is matched to its pitch.

III. In wearable haptic technology
Besides the image-to-audio mapping, some specific cues are reinforced by a haptic belt, composed of 120 Eccentric Rotating Mass motors. These vibration actuators are controlled by a micro-controller, which is supervised by the main processor. This too is an innovative step, where special types of transducers are used for sensory substitution.

A systematic approach was used in this research, generating a small suitable device for rendering tactile stimuli; it was shown that tactile response accuracy was improved. This device is called the vibro-sponge. Results from these two devices (belt and vibro-sponge) have been compared in the literature, and slight deviations were observed, giving birth to new hypothetical frames.

IV. Training and Testing procedures
It is well known that dedicated, well designed and coordinated training leads to major performance improvements in the acquisition and usage of any new skill. This aspect of the electronic travel aid design has been largely underestimated (omitted) in earlier approaches to developing technical mobility aids for the blind. The few, recent cases that have actually focused more on training than on the technical solution have achieved some of the best results.

The Sound of Vision consortium has devoted a special effort to the development of both training and testing protocols, since the envisioned Sound of Vision system will be an entirely new type of tool for the visually impaired to handle.

V. Virtual Training Environment Design
Sound of Vision regards training as the unique key for users to unlock the full potential of the device and achieve full proficiency. Gamified virtual training is the most important component, providing high availability, cost effectiveness, motivation and safety for the end users.

The training mode of the device will provide a collection of serious games based on virtual environments of gradually increasing complexity. The design of these serious games was a highly innovative and creative activity, and the results – described in detail in deliverable D2.5 – represent a clear progress beyond the state of the art in the areas of serious games as well as of training for/with assistive or rehabilitative solutions.

The main challenges were audio+haptic-only output modalities, users' group particularities and lack of previous research in the area. The Sound of Vision team's key perspective was to keep the training focused on realistic tasks but at the same time highly entertaining.

The detailed design proposes new solutions to provide immersion, motivation, entertainment and user retention/engagement through environment structure, interaction modalities, scenarios, narratives, character development, personalization and multiplayer interactions.

After implementation, the virtual training environments will be open sourced, to facilitate further development by third parties and other applications for the visually impaired in the areas of education and entertainment.

On a broader dimension, this work may open new doors for the rehabilitation, training and entertainment of people with various other categories of impairments or disabilities.

VI. Monitoring stress and other psychological cues during orientation of the blind user
The Sound of Vision team has contributed to advance knowledge on the interplay between cognitive-affective processes and the urban environmental challenges of mobility (both indoor and outdoor) faced by the visually impaired through state-of-the-art non-invasive physiological monitoring. The Sound of Vision research focuses on improving the experience of the visually impaired when navigating in unfamiliar environments. In this matter, the identification of Urban Mobility Challenges for the Visually Impaired with Mobile Monitoring of Multimodal Biosignals has been achieved. With the same scheme, stress detection during indoor mobility has also been researched. EEG and peripherical biosignals within strong classification algorithms as the multimodal have been used.

Classical studies rely on quantitative/qualitative methods such as self-reported surveys, matched by EEG and EDA cues. The proposal of a multimodal framework for the assessment of the cognitive-emotional experience of the visually impaired person, based on ambulatory monitoring and multimodal fusion of EEG and EDA signals, is unique. These two sets of data contain complementary information which, when combined, are the source of a robust meta-analysis system.

The proposed framework is based on a random forest classifier which successfully infers the correct urban environment in which the visually impaired person is navigating in a real-life experimental scenario. The definition of these environments was based on the current literature in orientation and mobility for visually impaired persons, incorporating the most challenging urban scenarios.

Geolocating the most predictive multimodal features that relate to cognitive load and stress, we provided further insights into the relationship of specific biomarkers with the environmental/situational factors that evoked them. Interestingly, geolocating the most predictive features of stress and cognitive load automatically extracted from the model indicated as stressful the same 'hotspots' in urban environments as those in the self-reported stressful situations experienced by the participants.

The results in the outdoor environment were extended to scenarios in orientation and mobility in unfamiliar indoor environments, again based on ambulatory monitoring and the fusion of multimodal biosignal data, namely EEG and EDA. The attempt resulted in the first study to assess in an automatic way the understanding of stressors that the visually impaired person experiences during a 'real-life' indoor orientation and mobility task. The charted route was designed so as to combine the major mobility challenges which visually impaired persons face when navigating in unfamiliar indoor environments, based on the contemporary literature. Furthermore, the most predictive biomarkers for the automatic understanding of cognitive-emotional experience were pinpointed.

An in-depth comparison of the indoor and outdoor mobility experience of the visually impaired person when navigating in unfamiliar environments was carried out, which improved the understanding of the neural processes which enable spatial perception and cognition when navigating in unfamiliar indoor or outdoor environments.

The findings will hopefully pave the way for emotionally intelligent mobile technologies that take the concept of navigation one step further, accounting not only for the shortest path but also for the most effortless, least stressful and safest one. Overall, the Sound of Vision project has contributed to the advancement of state-of-the-art knowledge in assessing cognitive-emotional experiences for the visually impaired with the automatic real-time prediction of the urban environments in which the visually impaired person is (indoor and outdoor), based on the use of multimodal signals (EEG and EDA) in 'real-life' experimental designs.

Some of the contributions/achievements of this particular field may be listed as follows:
- Affective computing (association of biomarkers and learning algorithms for robust urban scene recognition based on affect and cognition)
- Usability/UX assessment using multimodal (EEG, EDA) biosignals in mobility studies
- First indoor mobility study to assess cognitive-emotional processes of the VIP with multimodal biosignals
- Advances in the prediction rate of the urban scene recognition, both in indoor and outdoor environments, based on biosignals.

Related information

Record Number: 194864 / Last updated on: 2017-02-16