Periodic Reporting for period 2 - UTTER (Unified Transcription and Translation for Extended Reality)
Période du rapport: 2024-04-01 au 2025-09-30
So whilst we already have the ability to gather people instantly and cheaply from all over the world, in UTTER we will develop technologies that will provide the next level of support for online and hybrid meetings, that will offer an eXtended Reality (XR) experience. We will provide the type of meeting support that would normally only have been possible with a personal assistant, such as summarising the meeting and answering questions about it. Our XR models will also be applied to online customer support. Here we will help an organisation provide global customer support through transcription and translation. And furthermore we will develop models for things like personalized support in the customer's own language, quality control, and assisting human agents with clarity, empathy, and cultural awareness.
All of these abilities can now be considered because we have had a revolution in the representative power of deep learning models: transformer architectures trained with self-supervision on vast amounts of unlabelled data – models such as BERT and GPT-3 – are able to encode various levels of linguistic, semantic, and factual information, being potentially adaptable to multiple end tasks. These large pre-trained XR models are soon to become a commodity -- yet in their current form they pose serious concerns: they are English-centric and mostly designed to deal with text only; their adaptation to diverse tasks is brittle, leading to frequent hallucinations, non-factual output, biases, and leakage of private information; they are rather opaque, making decisions by means that are not amenable to human inspection; they are fragile to noise and other forms of adverse input, and often fail with high confidence; they are not open to the community, their training is not reproducible and they are highly inefficient (to train and to use).
UTTER takes these models to the next level, by making them multimodal (supporting text and speech), multilingual, adaptable, safe, controllable, robust, and more efficient to train and to deploy.
We have contributed novel multilingual data for XR model pretraining, as well as novel XR models. Our contributions to data creation span both speech and text modalities: In speech, mHuBERT-147 (147 languages) and Speech-MASSIVE (12 languages); in text, ELITR-bench, MAIA, PMIndiaSum, and other datasets addressing machine translation, language identification, and multilingual natural language understanding. In terms of models, we built mHuBERT-147 (for speech), TowerLM (for text translation tasks), and Spire (an extension of Tower to the speech modality). In RP2 we released the EuroLLM suite of models, three multilingual foundation models with sizes 1.7B/9B/22B trained from scratch through a EuroHPC extreme-scale grant, and supporting all 24 EU official languages plus 11 additional languages, and achieving state-of-the-art performance in various multilingual benchmarks. We also built two task-specific efficient models: the Multilingual DistilWhisper for automatic speech recognition, and an approach for efficient CTC regularization for speech translation.
Obj. 2: Development of new methods and algorithms for adapted, contextualised, and robust dialogue assistant XR models.
We have contributed datasets, methodology, software and empirical observations to advance various aspects of adaptation, contextualisation, uncertainty-awareness, explainability and robustness of XR models. Our contributions span 160 publications, associated open-source repositories, and have won 5 awards at research conferences. In an effort to accelerate development in these topics, we co-organised various international events (6 shared tasks and 1 workshop).
Obj. 3: Development of tools for online meetings and customer support agents.
We developed a customer support assistant prototype incorporating technology developed in UTTER (the Tower+ and xCOMET models), including machine translation, quality estimation, grammatical error correction, cultural appropriateness detection and adaptation, and emotion recognition of customer messages.
We developed a meeting assistant prototype, developed to test long-context large language models (LLMs) in realistic settings. In the first year, we built a general-purpose, LLM-powered assistant for friendly, informal meeting interactions. The 2nd year version added robustness to ambiguity, noise, and edge cases. The 3rd year lead to a trustworthy-by-design assistant built on NAV’s Trust Mediator (TM) framework, incorporating input filtering, safeguards and compliance checks which are core components for building accountable AI systems. We have demonstrated these prototypes in user days and they have undergone evaluation.
Obj. 4: Sustainable, maintainable platform and services.
We released TowerEval, an open-source LLM evaluation repository and toolkit for several different text-based tasks, ranging from translation to grammatical error correction. Models and datasets developed in Obj. 1 have been released with open weights in the Hugging Face ecosystem and downloaded over 1.7M times so far. Unbabel’s Widn.ai translation service was built by leveraging the Tower models developed in UTTER and combining it with proprietary technology and resources (Figure 3). The Naver trustworthy-by-design assistant, which builds on the Trust Mediator (TM) framework, a key contribution to Obj. 3, also supports this objective