Skip to main content
Aller à la page d’accueil de la Commission européenne (s’ouvre dans une nouvelle fenêtre)
français fr
CORDIS - Résultats de la recherche de l’UE
CORDIS

Unified Transcription and Translation for Extended Reality

Periodic Reporting for period 2 - UTTER (Unified Transcription and Translation for Extended Reality)

Période du rapport: 2024-04-01 au 2025-09-30

Throughout the COVID19 pandemic, we have seen society endure a series of lockdowns. Although this has been extremely difficult to manage, many people have been able to continue working and studying online. This is largely due to availability of online video conferencing tools, which provide video, audio, screensharing, and chat capabilities. Though in-person meetings are possible again, the ease of meeting online persists, and hybrid working is a new norm. This brings together the best aspects of both formats, with an increase in global accessibility, time-saving, and substantial savings in terms of transport costs and carbon-emission reduction. But it also comes with important challenges and new opportunities: How can we promote a good experience to both in-person and online participants? How can we promote global collaborations where people can speak their own language, understand each other and be understood?

So whilst we already have the ability to gather people instantly and cheaply from all over the world, in UTTER we will develop technologies that will provide the next level of support for online and hybrid meetings, that will offer an eXtended Reality (XR) experience. We will provide the type of meeting support that would normally only have been possible with a personal assistant, such as summarising the meeting and answering questions about it. Our XR models will also be applied to online customer support. Here we will help an organisation provide global customer support through transcription and translation. And furthermore we will develop models for things like personalized support in the customer's own language, quality control, and assisting human agents with clarity, empathy, and cultural awareness.

All of these abilities can now be considered because we have had a revolution in the representative power of deep learning models: transformer architectures trained with self-supervision on vast amounts of unlabelled data – models such as BERT and GPT-3 – are able to encode various levels of linguistic, semantic, and factual information, being potentially adaptable to multiple end tasks. These large pre-trained XR models are soon to become a commodity -- yet in their current form they pose serious concerns: they are English-centric and mostly designed to deal with text only; their adaptation to diverse tasks is brittle, leading to frequent hallucinations, non-factual output, biases, and leakage of private information; they are rather opaque, making decisions by means that are not amenable to human inspection; they are fragile to noise and other forms of adverse input, and often fail with high confidence; they are not open to the community, their training is not reproducible and they are highly inefficient (to train and to use).

UTTER takes these models to the next level, by making them multimodal (supporting text and speech), multilingual, adaptable, safe, controllable, robust, and more efficient to train and to deploy.
Obj. 1: Advancing research in responsible creation and application of multilingual, multimodal pretrained XR models.

We have contributed novel multilingual data for XR model pretraining, as well as novel XR models. Our contributions to data creation span both speech and text modalities: In speech, mHuBERT-147 (147 languages) and Speech-MASSIVE (12 languages); in text, ELITR-bench, MAIA, PMIndiaSum, and other datasets addressing machine translation, language identification, and multilingual natural language understanding. In terms of models, we built mHuBERT-147 (for speech), TowerLM (for text translation tasks), and Spire (an extension of Tower to the speech modality). In RP2 we released the EuroLLM suite of models, three multilingual foundation models with sizes 1.7B/9B/22B trained from scratch through a EuroHPC extreme-scale grant, and supporting all 24 EU official languages plus 11 additional languages, and achieving state-of-the-art performance in various multilingual benchmarks. We also built two task-specific efficient models: the Multilingual DistilWhisper for automatic speech recognition, and an approach for efficient CTC regularization for speech translation.

Obj. 2: Development of new methods and algorithms for adapted, contextualised, and robust dialogue assistant XR models.

We have contributed datasets, methodology, software and empirical observations to advance various aspects of adaptation, contextualisation, uncertainty-awareness, explainability and robustness of XR models. Our contributions span 160 publications, associated open-source repositories, and have won 5 awards at research conferences. In an effort to accelerate development in these topics, we co-organised various international events (6 shared tasks and 1 workshop).

Obj. 3: Development of tools for online meetings and customer support agents.

We developed a customer support assistant prototype incorporating technology developed in UTTER (the Tower+ and xCOMET models), including machine translation, quality estimation, grammatical error correction, cultural appropriateness detection and adaptation, and emotion recognition of customer messages.
We developed a meeting assistant prototype, developed to test long-context large language models (LLMs) in realistic settings. In the first year, we built a general-purpose, LLM-powered assistant for friendly, informal meeting interactions. The 2nd year version added robustness to ambiguity, noise, and edge cases. The 3rd year lead to a trustworthy-by-design assistant built on NAV’s Trust Mediator (TM) framework, incorporating input filtering, safeguards and compliance checks which are core components for building accountable AI systems. We have demonstrated these prototypes in user days and they have undergone evaluation.

Obj. 4: Sustainable, maintainable platform and services.

We released TowerEval, an open-source LLM evaluation repository and toolkit for several different text-based tasks, ranging from translation to grammatical error correction. Models and datasets developed in Obj. 1 have been released with open weights in the Hugging Face ecosystem and downloaded over 1.7M times so far. Unbabel’s Widn.ai translation service was built by leveraging the Tower models developed in UTTER and combining it with proprietary technology and resources (Figure 3). The Naver trustworthy-by-design assistant, which builds on the Trust Mediator (TM) framework, a key contribution to Obj. 3, also supports this objective
Since the start, UTTER has consistently pushed the state of the art in research, through publishing the best in-class approaches to cross-lingual and multilingual tasks such as machine translation, translation evaluation, translation quality estimation, grammatical error correction, automatic post-editing, low-resource speech translation; through models such as TowerLM, xCOMET, CometKiwi23, and mHuBert-147. Specifically, our models have won the chat translation shared task at WMT2024, the biomedical domain shared task at WMT2024, the quality estimation shared task at WMT 2023, the instruction following short track task at IWSLT 2025, and the low-resource speech translation task at IWSLT 2023. We also published the first dataset for evaluating long-context LLMs in a specific use case of a meeting assistant (ELITR-Bench). As UTTER is at a relatively low TRL, research state-of-the-art is the most relevant metric for our work.
UTTER's first in person meeting in Amsterdam
UTTER's group photo- first in person meeting in Amsterdam
Mon livret 0 0