Skip to main content

aRTIFICIAL iNTELLIGENCE for the Deaf

Periodic Reporting for period 1 - aiD (aRTIFICIAL iNTELLIGENCE for the Deaf)

Reporting period: 2019-12-01 to 2021-11-30

aiD pursues cross-disciplinary breakthrough innovation that builds and extends upon the latest academic research advances. Its overarching goal is to offer a comprehensive suite of solutions catering to deaf people communication needs. aiD has the following objectives: 1) Create multi-modal, high-accuracy deep learning models for deaf/hearing people interactive communication. 2) Develop deep network compression techniques for scaling the developed models to mobile devices. 3) Implement application pilots of immense practical importance that will be evaluated by the end-users. 4) Promote global-scale excellence in science by supporting the Consortium partners to become leading experts in their field and transfer their acquired expertise to research institutions, companies and the public.

Our considered pilots deal with realistic operating environments that allow for end-user engagement and a solid exploitation plan: (i) an AR news service; (ii) an Automated Relay Service for emergencies; and (iii) an Interactive Digital Tutor application. We pursue technological development on multiple technological frontiers: signal processing, signal perception and generation via advanced ML, creation of virtual SL signals in an AR environment, and scalability of the developed technologies on commodity mobile devices, accessible to the vast majority of potential users.

To achieve these objectives, aiD implements an inter-sectorial secondment program for Experienced Researchers (ERs) and Early Stage Researchers (ESRs) that fosters knowledge exchange between academic experts and industrial leaders in bleeding-edge technological fields. The social significance and bleeding-edge nature of our endeavors has resulted in the aiD consortium having already attracted the strong interest and commitment of a number of important supporting institutions. These will provide: i) a large end-user base; ii) deep knowledge of deaf people needs; and iii) vast dissemination and exploitation avenues.
Work package 2
Task 2.1 – Business and technical requirements specification (Completed)
Deliverable D2.1 offers a very detailed and solid description of the specification, requirements, and reference architecture, and is used as a foundation for future work.

Task 2.2 – Data management plan (Completed)
The corresponding deliverable D2.2 – Data Management Report led by CUT has been submitted on time. This deliverable provides the necessary guidelines regarding implementation of Open Science policies and data management.

Task 2.3 – Reference Architecture (Completed)
The submitted deliverable D2.1 describes aiD components/architectures in detail, along with the modelling pipelines and model integration that is used by the aiD project.

Work package 4
Task 4.1 - Dataset curation (Completed)
In the context of this task, Deliverable D4.1 was submitted. We created/collected and curated the largest open annotated SL footage in the literature. This dataset is publicly available on Zenodo.org.

Task 4.2 - SL-to-text algorithm. (Ongoing)
We developed a Sequence-to-Sequence model for text generation from SL (SLT). This led to the publication of the results in the top-tier ICCV conference. This publication not only represents the most advanced and accurate method for SLT ever presented up to date, but also a method with an integrated compression technique that led to 70% compression without any loss in accuracy. This will be at the core of the final contribution of WP4 that will be contained on D4.2.

Task 4.3 - Text-to-speech algorithm (Completed)
We started the development phase of the text-to-speech (TTS) task by implementing and exploring the performance of the well-known Tacotron-2, obtaining models with significantly reduced training and inference times.

Work package 5
Task 5.1 - Dataset curation (In Progress)
Deliverable D5.1 – Curated dataset for learning to generate avatars depicting SL is almost complete and ready to be submitted ahead of schedule. Its outcome will represent the largest dataset for training deep generative models for Text-to-SL (SLG).

Task 5.2 - Speech-to-text algorithm (Completed)
The aiD consortium has developed new libraries both for speech-to-text and text-to-speech tasks, extending upon state-of-the-art deep neural networks, including Jasper.

Task 5.3 - Text-to-trajectories algorithm (M5, M38) (Ongoing)
We are currently developing prototypes for text-to-trajectories, based on recent advances in state-of-the-art Transformer generative models, and specifically the Progressive Transformer.

Task 5.4 – AR environment (Ongoing)
The involved researchers are currently experimenting on several technical solutions for the creation of the AR environment.

Work package 6
Task 6.1 – Bayesian Inference Mechanisms (Ongoing)
The currently developed techniques achieve a compression rate of more than 70% for the considered networks, without any loss in the corresponding accuracy metric.

Task 6.2 – Network Distillation approaches (Ongoing)
CUT has commenced a thorough review of the existing literature.

Work package 7 – Pilots and Evaluation
There is a very active and open discussion and collaboration among the researchers that work in WP7, WP4 and WP5. All methodology developments have been timely and accurately disseminated to pilot development and incorporated into pilots.
aiD aims to address the challenge of deaf people communication and social integration by leveraging the latest advances in ML, HCI and AR. Specifically, speech-to-text/text-to-speech algorithms have currently reached high performance, as a product of the latest breakthrough advances in the field of deep learning (DL). However, the commercially available systems cannot be readily integrated into a solution targeted to the communication between deaf and hearing people. On the other hand, existing research efforts to tackle the problem of transcribing SL video or generating synthetic SL footage (SL avatar) from text have failed to generate a satisfactory outcome.

aiD addresses both these problems. We develop speech-to-text/text-to-speech modules tailored to the requirements of a system addressing the communication of the deaf. Most importantly, we systematically address the core technological challenge of SL transcription and generation in an AR environment. Our vision is to exploit and advance the state-of-the-art in DL to solve these problems with groundbreaking accuracy, in a fashion amenable to commodity mobile hardware. This will be in stark contrast to existing systems which either depend on sophisticated costly equipment (multiple vision sensors, gloves, and wristbands), or are lab-only systems limited to fingerspelling as opposed to the official SL that deaf people actually use. Indeed, the current state-of-the-art requires expensive devices and operates on a word-by-word basis, thus missing the syntactic context. Finally, these solutions are not amenable to commodity mobile devices.

Our vision is to resolve these staggering inadequacies so as to offer a concrete solution that addresses real time interaction between deaf and hearing people. Our core innovation lies in the development of new algorithms and techniques that enable the real-time translation of SL video to text or speech and vice-versa (SL avatar generation from speech/text in an AR environment), with satisfactory accuracy, in a fashion amenable to commodity mobile devices such as smartphones and tablets.
aiD logo