Multimodal context and voice recognition for seamless voice control technology interfaces with low upfront cost

Informations projet

XMOS

N° de convention de subvention: 849469

DOI

10.3030/849469

Projet clôturé

Date de signature de la CE 28 Mai 2019

Date de début 1 Janvier 2019

Date de fin 31 Mars 2021

Financé au titre de

INDUSTRIAL LEADERSHIP - Innovation In SMEs

Coût total

€ 2 983 500,00

Contribution de l’UE

€ 2 088 450,00

2 088 450,00

895 050,00

Coordonné par

XMOS LIMITED
United Kingdom

Periodic Reporting for period 2 - XMOS (Multimodal context and voice recognition for seamless voice control technology interfaces with low upfront cost)

Période du rapport: 2020-01-01 au 2021-03-31

Society is entering an era of intelligent connectivity. Technology providers are designing these capabilities into more and more products for the Internet of Things (IoT) as consumers become comfortable with smart devices.

With potentially hundreds of these devices in our homes, cars, cities – everywhere we go – we need new, easier and more natural ways of interacting with them, freeing our hands from keyboards and control panels. We will want our devices to know when to interact with us and – just as importantly - when we require privacy. Our devices will need to detect our presence, distinguish us from others and interpret multiple data streams to 'understand' the environment in which they are operating.

Today, voice-enabled devices, such as smart speakers, use a costly "applications processor" to perform communications and control processing in conjunction with cloud-based servers. However, when applied to the IoT, this approach has significant disadvantages, the most important being:

- Security & privacy: sensitive personal information transported and stored externally
- Energy consumption: processing raw data in the cloud consumes significant energy
- Cost: limits the range of viable applications.

These factors limit the adoption of novel human-machine interfaces in a vast range of IoT applications. For example one of the key features, a voice interface, which can vastly improve user experience is currently the domain of large, expensive processors.

The XMOS project has developed a set of core technologies for low-cost voice interfaces and sensor aggregation systems. These will enable next generation human-machine interfaces, allowing manufacturers to rapidly incorporate voice interfaces into a wide range of products. Incorporating AI inferencing techniques, the device can process information locally, reducing privacy and bandwidth issues and allowing manufacturers to add differentiating features.

XMOS has designed a low-cost, cloud-connected, "Thin-client" AIoT design, comprising a far-field voice interface, Artificial Intelligence inferencing and user customisation. This design can be produced as a stand-alone module or integrated into a complex system. The architecture is flexible, enabling developers to add new capabilities and AI models to support a wide range of use cases.

The Thin-client's central component is xcore.ai a cross-over processor that integrates most of the digital functionality required at a low cost. The XMOS AIoT module demonstrates this in a very compact voice interface design (55mm x 30mm) including microphones, AI inferencing, and Wi-Fi.

Incorporating voice control into an existing product design was one of the primary use cases for this project, so removing the need for an engineer to learn new programming environments was a critical factor in gaining wider customer acceptance. XMOS developed a framework that enables machine learning engineers to develop AI models using their preferred tools.

XMOS investigated three areas of embedded AI voice processing: speaker localisation, identification and keyword detection. In all three areas, the core technology has been prototyped, demonstrated to be viable, and the algorithms and models made available for exploitation. Customer trials indicated most interest in local, customised keyword detection as this offers brand differentiation without a dependency on the tech giants’ voice service platforms.

XMOS also investigated short-range radar to add complementary functionality to the voice interface as this allows a system to detect human presence, especially people's movement, without the privacy implications of capturing visual images.

Using a 60GHz radar chip, XMOS demonstrated detection and classification of people in a room and the AI tools could identify them as one of a group of pre-registered individuals. This enables new applications where an appliance can adapt automatically for each user, e.g. a cooker could disable functions depending on whether a child or adult is present.

While radar offers novel capabilities, other use cases require the higher resolution that cameras provide. Internet camera systems typically send images via the internet to cloud-based AI inferencing systems. Processing these images and running inferencing at the edge, i.e. without image data leaving the sensor, significantly improves performance and addresses privacy and data protection regulations.

An AIoT module with a camera that uses the XMOS AI tools to validate a live image against a pre-registered photograph was created. No captured image data is stored or transmitted, protecting privacy, and the module can take action locally (e.g. turning on an appliance only when the owner is present).

The XMOS project executed a diverse set of trials to validate the product/market fit of these designs. The feedback from users on the technical and market readiness for this research was very valuable. A clear market priority that emerged is for low-cost voice interfaces with embedded keyword detection, which can be designed into products today. Multi-sensor applications and edge AI capabilities are still in the early stages of market acceptance; there is much interest, but it will be a few years before mass adoption.

The main outcome of this project was the development of a far-field voice interface with AI-based sensor processing at a much lower cost than any previous solution. These capabilities coupled with a set of flexible input/output and networking interfaces, has resulted in an AIoT module that enables a wide range of applications.

XMOS has launched two commercial products and filed four patent applications as a result of this project. Several customers have incorporated parts of the output in products that will be in volume production in 2021. XMOS is also partnering with Microsoft to add AI voice capabilities to the recently announced Microsoft Azure Percept IoT platform.

XMOS engaged with the TinyML industry community and contributed to TensorFlow Lite, FreeRTOS and Clang projects. XMOS partnered with Plumerai to develop binarized neural network capabilities. These actions, combined with thought leadership activities, have consolidated the position of XMOS as a specialist in the fields of voice processing and AIoT, broadening global awareness of the XMOS brand.

After completing this project, XMOS intends to further exploit the AIoT design. The go-to-market plan is to release a AIoT reference design and software tools to customers in 2021, with the goal to derive the majority of the company’s revenue stream from the project deliverables in 2022-onwards.

XMOS has been able to develop a versatile, scalable, cost-effective and robust set of designs that manufacturers can use to build smarter, human-aware products that can facilitate higher quality human-machine interactions and make life simpler, safer, and more satisfying for all.

Potential customers testing the XVF3510 live at IFA.

XMOS stand at Voice Summit in Newark.

Prototype Radar device for detecting human presence.

TinyML Summit 2020 XMOS Poster session

XMOS Thin-client module

Low cost voice client (inside the white line in centre)

Amazon qualified 3510 demonstration and evaluation system.

Thin-client module incorporating radar

Periodic Reporting for period 2 - XMOS (Multimodal context and voice recognition for seamless voice control technology interfaces with low upfront cost)

Télécharger Télécharger le contenu de la page