Skip to main content

Hands-free Voice-enabled Interface to Web Applications for Smart Home Environments

Periodic Reporting for period 2 - LISTEN (Hands-free Voice-enabled Interface to Web Applications for Smart Home Environments)

Reporting period: 2017-06-01 to 2019-05-31

In this project, our goal has been to design a complete (software and hardware solution) voice-enabled interface specifically designed for Internet applications for the smart home environment, for controlling specific automations of the smart home, but also for providing access to the web for specific tasks, such as access to emails, social networking platforms, calendar, etc. A clear distinction compared to currently popular systems for voice-based Internet access is the Wireless Acoustic Sensor Network (WASN) approach, that supports a truly seamless operation of the voice interface. The proposed design was based on a distributed operation of several acoustic sensors, able to localise the speakers and enhance the speech capture for their particular locations, and enhance and transmit the speech signal to the speech transcription engine. LISTEN’s sensor network features low-cost hardware and stand-alone real-time operation due to a custom MEMS (micro-electro-mechanical systems) microphones design. The acoustic front-end was jointly developed and optimised with the automatic speech recognition (ASR) system. A central motivation behind forming the LISTEN consortium and objectives was to form a much-needed interdisciplinary team and research plan, so as to achieve substantial progress beyond state-of-the-art in hands-free ASR systems.

The objectives of the LISTEN project have been:
• Objective 1: To develop a large-vocabulary speech recognition system for the smart home, for multiple domains (e.g. command/control of the smart home functionalities, web access, web search, email / message dictation, calendar and to-do lists, social networks, etc.), and for multiple languages. Focus has been on English, Greek, Italian, and German to demonstrate the system’s easy adaptation to further languages.
• Objective 2: To develop a front-end for robust speech capture in a distributed fashion, performing in real-time localisation of multiple speakers and accordingly enhancing the speech capture for these specific locations. The front-end design includes the hardware design of a low-cost distributed acoustic sensor network, where each sensor is equipped with an on-board processor for the signal localisation and enhancement to be performed in a collaborative manner.
• Objective 3: To create a prototype and evaluate it in practical situations, including in the Ambient Intelligence Facility of the coordinator (FORTH). To widely disseminate the project results and maximise the potential societal and commercial impact of our activities and the developed platform.
The project work concentrated on two pillars, one being the work on wireless acoustic sensor networks (WASNs) for voice enhancement regardless of the speaker’s location and orientation in the smart home, the second pillar being a large-vocabulary speech recognition system for web-based services.

LISTEN achieved significant progress and remarkable innovations along the lines of the planned research. On one hand, we proposed WASN-based sound localization and directional speech enhancement, as well as a WASN sensor innovative design. A system that operates voice enhancement in real-time consisting of low-cost stand-alone sensors was designed, studied, and implemented. The novelty of this system is evidenced by several top-quality research publications. The “heart” of the system is the estimation of direction of arrival (DoA) of multiple sound signals from each microphone array sensor, making the problem of exact sound localization as a “bearing-only” estimation problem. For multiple concurrent sound sources, this is a difficult problem to solve, and our approach has been one of very few in this area, not only demonstrated by relevant publications but also via real-time demonstrations. In addition to this problem, we examined the problem of estimating the DoA using spherical microphone arrays, that provide information in the 3D space (vs. the 2D information that we are restricted to obtain when using planar arrays).

Similarly, the large-vocabulary speech recognition system optimized for the smart home and web-based services was implemented in the 4 languages of interest (English, Greek, German, Italian). The LISTEN consortium developed this solution in a fully local (on device) implementation, which offers the advantage of scalability (i.e. a WASN solution with tens of sensors per home, as envisioned by the consortium) and privacy to the users (no audio sent outside the home). Additionally, joint work in acoustic sensor networks and speech recognition was performed, and the innovation achieved is evidenced by the 2nd place our technology received in the widely acknowledged CHiME speech recognition “challenge”.

Overall, the technological objectives of LISTEN were clearly achieved. Progress in terms of project management and dissemination and exploitation of results is evidenced by the submission of 2 US patents, 30 top-quality scientific publications, 2 international workshops organized by the consortium academic partners, radio interviews of the LISTEN staff, newspaper reports on the project, school visits to partners premises, etc.
The expected potential impact of the research performed in LISTEN is multi-dimensional. The project aimed to address several issues of impact to the competitiveness of Europe regarding acoustic sensor networks, smart home environments and voice-based interfaces, which are related to many important aspects of the society, such as assisted living, monitoring and surveillance, work and entertainment. Thus, the project contributes to an improved quality of life in multiple dimensions. Our research has been performed at the same time with the introduction of several far-field voice interfaces in the market (such as the Amazon Echo), with consumer acceptance that clearly indicates the strong potential of voice-based interfaces in the Internet of Things (IoT) era. Before these devices, one could argue for the importance of this area in terms of economic impact. Nowadays, the introduction of these devices in the market made it common to read statements on the web such as that voice interfaces for the smart home has been “the billion dollar business no-one saw coming” ( Needless to say, the LISTEN consortium actually saw this opportunity coming, and the project in fact focused on what we believe is the next step in these interfaces, i.e. the seamless operation and ease-of-deployment in the smart home. The consortium, having developed within the project several exploitable research results, has a tremendous opportunity to gain direct economic impact from licensing its developed technologies or aiming directly the consumer market. The opportunities for the project partners to continue their collaboration beyond the LISTEN project are constantly being evaluated by the consortium. At the same time, the fact that our students or employees were trained through LISTEN to this area of technology, offered them a strong background for their career development and opens new career perspectives. In turn, the development of these skills, and the portfolio of innovation being built by LISTEN, can contribute to European competitiveness in a market (currently) dominated by US corporations.