aRTIFICIAL iNTELLIGENCE for the Deaf

Periodic Reporting for period 2 - aiD (aRTIFICIAL iNTELLIGENCE for the Deaf)

Periodo di rendicontazione: 2021-12-01 al 2023-11-30

aiD pursues cross-disciplinary breakthrough innovation that builds and extends upon the latest academic research advances. Its overarching goal is to offer a comprehensive suite of solutions catering to deaf people communication needs. aiD has the following objectives: 1) Create multi-modal, high-accuracy deep learning models for deaf/hearing people interactive communication. 2) Develop deep network compression techniques for scaling the developed models to mobile devices. 3) Implement application pilots of immense practical importance that will be evaluated by the end-users. 4) Promote global-scale excellence in science by supporting the Consortium partners to become leading experts in their field and transfer their acquired expertise to research institutions, companies and the public.

Our considered pilots deal with realistic operating environments that allow for end-user engagement and a solid exploitation plan: (i) an AR news service; (ii) an Automated Relay Service for emergencies; and (iii) an Interactive Digital Tutor application. We pursue technological development on multiple technological frontiers: signal processing, signal perception and generation via advanced ML, creation of virtual SL signals in an AR environment, and scalability of the developed technologies on commodity mobile devices, accessible to the vast majority of potential users.

To achieve these objectives, aiD implements an inter-sectorial secondment program for Experienced Researchers (ERs) and Early Stage Researchers (ESRs) that fosters knowledge exchange between academic experts and industrial leaders in bleeding-edge technological fields. The social significance and bleeding-edge nature of our endeavors has resulted in the aiD consortium having already attracted the strong interest and commitment of a number of important supporting institutions. These will provide: i) a large end-user base; ii) deep knowledge of deaf people needs; and iii) vast dissemination and exploitation avenues.

Work package 2
Task 2.1 – Business and technical requirements specification (Completed)
Deliverable D2.1 offers a very detailed and solid description of the specification, requirements, and reference architecture, and is used as a foundation for future work.

Task 2.2 – Data management plan (Completed)
The corresponding deliverable D2.2 – Data Management Report led by CUT has been submitted on time.

Task 2.3 – Reference Architecture (Completed)
The submitted deliverable D2.1 describes aiD components/architectures in detail, along with the modelling pipelines and model integration that is used by the aiD project.

Work package 4
Task 4.1 - Dataset curation (Completed)
In the context of this task, Deliverable D4.1 was submitted. We created/collected and curated the largest open annotated SL footage in the literature. This dataset is publicly available on Zenodo.org.

Task 4.2 - SL-to-text algorithm. (Completed)
We developed a Transformer model for text generation from SL (SLT), published in the top-tier ICCV conference. The model was subsequently fine-tuned by considering trajectory feature extraction mechanisms such as AlphaPose and OpenPose; the final work was published in ICCV 2023. The final contribution of the Task is contained in D4.2 and used in the pilots.

Task 4.3 - Text-to-speech algorithm (Completed)
We started the development phase of the text-to-speech (TTS) task by implementing and exploring the performance of the well-known Tacotron-2. This is part of D4.2.

Work package 5
Task 5.1 - Dataset curation (Completed)
Deliverable D5.1 – Curated dataset for learning to generate realistically looking videos of SL translation, by means of generative AI. The outcome represents the largest dataset for training deep generative models for Text-to-SL video (SLG).

Task 5.2 - Speech-to-text algorithm (Completed)
The aiD consortium has developed new libraries both for speech-to-text and text-to-speech tasks, extending upon state-of-the-art deep neural networks, including Jasper. This is part of D5.2.

Task 5.3 & Task 5.4 - Text-to-SL video (Completed)
We initially experimented with Transformer generative models, mainly the Progressive Transformer, for learning to generate trajectories from text; this would be fed to an expensive and complicate graphics engine. We observed an outcome of disturbingly low accuracy, which was commensurate with the best methods reported in the literature.
Drawing from these results, we took the bold step of considering bleeding-edge denoising diffusion probabilistic models (DDPMs) for text-to-video. We fine-tuned on the SL video generation task, and obtained a groundbreaking outcome, which is both more efficient and accurate than any existing alternative (publication pending; part of D5.2).

Work package 6
Task 6.1 – Bayesian Inference Mechanisms (Completed)
The developed techniques achieve a compression rate of more than 70% for the SLT model.

Task 6.2 – Network Distillation approaches (Completed)
We experiments with distillation on DDPMs to yield scalable models for SLG.

Work package 7 – Pilots and Evaluation (Completed)
We had a very active and open discussion and collaboration among the researchers that worked in WP7, WP4 and WP5. All methodology developments have been timely and accurately disseminated to pilot development and incorporated into pilots.

aiD addressed the challenge of deaf people communication and social integration by leveraging the latest advances in generative AI.

Specifically, speech-to-text/text-to-speech algorithms have currently reached high performance. However, the commercially available systems cannot be readily integrated into a solution targeted to the communication between deaf and hearing people. aiD addressed this integration challenge.

On the other hand, previous research efforts to tackle the problem of transcribing SL video were fraught with (i) limited scope, as they could address only specific datasets and had limited generalization capacity outside the training dataset; (ii) accuracy that did not allow for a highly usable outcome from the end-user perspective; and (iii) high computational costs for inference, which limited scalability. We addressed these issues via novel Transformer networks with (i) LWTA layers for higher accuracy, (ii) variational inference-driven compression with virtually no accuracy loss, and (iii) novel feature extraction principles based on AlphaPose that generalize across different datasets.

Further, we focused our innovation impetus on generating synthetic SL footage from text; existing approaches have failed to generate any meaningful outcome outside very limited scenarios. We performed deep ablations and found two main reasons for this lacklustre performance: (i) the use of Transformers, which excel in discrete data, to generate continuous trajectories and (ii) the reliance on a coordination between deep networks and graphics, which is hard due to the asynchronous nature of Transformers. We took the bold step of leveraging bleeding-edge denoising diffusion probabilistic models (DDPMs) for text-to-video, in an effort to resolve all these issues. We yielded a groundbreaking, few-shot text-to-SL video method that requires low training costs, outperforms the accuracy of existing approaches by a large margin (quantitively and qualitatively), and can perform inference on a simple PC, by leveraging distillation.

aiD logo

Periodic Reporting for period 2 - aiD (aRTIFICIAL iNTELLIGENCE for the Deaf)

Condividi questa pagina

Scarica