Skip to main content
Aller à la page d’accueil de la Commission européenne (s’ouvre dans une nouvelle fenêtre)
français français
CORDIS - Résultats de la recherche de l’UE
CORDIS

Unified Transcription and Translation for Extended Reality

Periodic Reporting for period 1 - UTTER (Unified Transcription and Translation for Extended Reality)

Période du rapport: 2022-10-01 au 2024-03-31

Throughout the COVID19 pandemic, we have seen society endure a series of lockdowns. Although this has been extremely difficult to manage, many people have been able to continue working and studying online. This is largely due to availability of online video conferencing tools, which provide video, audio, screensharing, and chat capabilities. Though in-person meetings are possible again, the ease of meeting online persists, pointing to a future where hybrid formats will become the new norm. This brings together the best aspects of both formats, with an increase in global accessibility, time-saving, and substantial savings in terms of transport costs and carbon-emission reduction. But it also comes with important challenges and new opportunities: How can we promote a good experience to both in-person and online participants? How can we promote global collaborations where people can speak their own language, understand each other and be understood?

So whilst we already have the ability to gather people instantly and cheaply from all over the world, in UTTER we will develop technologies that will provide the next level of support for online and hybrid meetings, that will offer an eXtended Reality (XR) experience. We will provide the type of meeting support that would normally only have been possible with a personal assistant, such as summarising the meeting and answering questions about it. Our XR models will also be applied to online customer support. Here we will help an organisation provide global customer support through transcription and translation. And furthermore we will develop models for things like personalized support in the customer's own language, quality control, and assisting human agents with clarity, empathy, and cultural awareness.

All of these abilities can now be considered because we have had a revolution in the representative power of deep learning models: transformer architectures trained with self-supervision on vast amounts of unlabelled data – models such as BERT and GPT-3 – are able to encode various levels of linguistic, semantic, and factual information, being potentially adaptable to multiple end tasks. These large pre-trained XR models are soon to become a commodity -- yet in their current form they pose serious concerns: they are English-centric and mostly designed to deal with text only; their adaptation to diverse tasks is brittle, leading to frequent hallucinations, non-factual output, biases, and leakage of private information; they are rather opaque, making decisions by means that are not amenable to human inspection; they are fragile to noise and other forms of adverse input, and often fail with high confidence; they are not open to the community, their training is not reproducible and they are highly inefficient (to train and to use).

UTTER takes these models to the next level, by making them multimodal (supporting text and speech), multilingual, adaptable, safe, controllable, robust, and more efficient to train and to deploy.
In the first reporting period, the UTTER consortium advanced as planned along its main objectives. We describe the work done towards them in more detail below.

Objective 1: Advancing research in responsible creation and application of multilingual, multimodal pretrained XR models.

We have contributed novel multilingual data for XR model pretraining, as well as novel XR models. Our contributions to data creation span both speech and text modalities. In speech, we have collected extensive multilingual datasets such as mHuBERT-147 (147 languages) and Speech-MASSIVE (12 languages). In text, our contributions include datasets like ELITR-bench for meeting assistant, MAIA for customer care assistant, PMIndiaSum for summarization, and other datasets addressing machine translation, language identification, and multilingual natural language understanding. In terms of models, we built two foundation models: mHuBERT-147 (for speech) and TowerLM (for text). We also built two task-specific efficient models: the Multilingual DistilWhisper for automatic speech recognition, and an approach for efficient CTC regularization for speech translation. All of our contributions are open-source (code and data), TowerLM, for example, has been downloaded over 15,000 times two months from its release.

Objective 2: Development of new methods and algorithms for adapted, contextualised, and robust dialogue assistant XR models.

We have contributed datasets, methodology, software and empirical observations to advance various aspects of adaptation, contextualisation, uncertainty-awareness, explainability and robustness of XR models. Our contributions span 49 open-access publications and over 30 repositories including code and data (which have gathered over 500 stars on github at the time of reporting). In an effort to accelerate development in these topics, we co-organised various international events (6 shared tasks and 1 workshop).

Objective 3: Development of tools for online meetings and customer support agents.

We have developed two modular prototypes to support our two use cases (earlier versions of which we have demonstrated in our First User Day; a video-demo is available on our website and on YouTube). One prototype is a customer service assistant that supports a bilingual text-based conversation between a client and a customer service expert. Besides performing translation (and, hence, enabling a bilingual conversation), the assistant is able to perform quality estimation, sentiment analysis and make recommendations that help the two human participants to observe each other’s cultural norms. The other prototype is an online meeting assistant that offers interactive summarisation features through a question answering (chat-based) API. The current version works on transcripts of an online meeting, and can be used to retrieve and infer information about a meeting with multiple online participants.

Objective 4: Sustainable, maintainable platform and services.

This objective mostly concerns tasks planned for the second reporting period, but we already made some initial contributions. For example, we developed TowerEval, an evaluation repository and toolkit that can be used for several different text-based tasks, ranging from translation to grammatical error correction. TowerEval is open-source, available on GitHub (12 stars at the time of reporting) and through the HuggingFace Hub.
Since the start, UTTER has consistently pushed the state of the art in research, through publishing the best in-class approaches to cross-lingual and multilingual tasks such as machine translation, translation evaluation, translation quality estimation, grammatical error correction, automatic post-editing, low-resource speech translation; through models such as TowerLM, xCOMET, CometKiwi23, or Naver’s entry to the IWSLT’23 shared task. We also published the first dataset for evaluating long-context LLMs in a specific use case of a meeting assistant (ELITR-Bench). As UTTER is at a relatively low TRL, research state-of-the-art is the most relevant metric for our work in RP1.
UTTER's first in person meeting in Amsterdam
UTTER's group photo- first in person meeting in Amsterdam
Mon livret 0 0