Skip to main content
CORDIS - Forschungsergebnisse der EU
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

Found in Translation – Natural Language Understanding with Cross-Lingual Grounding

Periodic Reporting for period 4 - FoTran (Found in Translation – Natural Language Understanding with Cross-Lingual Grounding)

Berichtszeitraum: 2023-03-01 bis 2024-03-31

Natural language understanding is the “holy grail” of computational linguistics and a long-term goal in research on artificial intelligence. Understanding human communication is difficult due to the various ambiguities in natural languages and the wide range of contextual dependencies required to resolve them. Discovering the semantics behind language input is necessary for proper interpretation in interactive tools, which requires an abstraction from language-specific forms to language-independent meaning representations.

The goal of the FoTran project is to develop models for natural language understanding trained on implicit information given by large collections of human translations. We apply massively parallel data sets of up to a thousand languages to acquire meaning representations that can be used for reasoning with natural languages and for multilingual neural machine translation.

A guiding principle in our research is what we call “cross-lingual grounding” and the effect of resolving ambiguities through translation. The beauty of that idea is the use of naturally occurring data, i.e. human translations, instead of artificially created resources and costly manual annotations in our machine learning approach. The general framework is based on deep learning and neural machine translation and the hypothesis is that training on increasing amounts of linguistically diverse data improves the abstractions found by the model.

The overall objectives of the project are:
* the development and implementation of models that can learn sentence-level semantics from massively parallel data sets
* the analyses and interpretation of the meaning representations that are learned
* the application of emerging representations as an “interlingua” in machine translation and for advanced reasoning with natural languages

Besides the scientific interest, the project objectives also have a strong societal impact. Language technology already plays a crucial role in human communication in the digital world. Improving the ability to properly understand language input will influence the capabilities of interactive systems and tools. The potential is enormous. Short term goals include the development of accurate translation services for many more languages with better abstraction, domain knowledge and context-awareness. Long term goals include the development of complex interactive and intelligent machines with human-like language interfaces and deeper world knowledge.

The Fotran project has contributed to all those aspects mentioned above. Multilingual tools and resources have been released for public use and further research. Scientific publications shed light on neural language and translation models and novel modular architectures support sustainable and reusable components.
The FoTran project started with a successful kick-off workshop in September 2018 that attracted over 70 participants and featured international experts in the field of natural language processing. The initial phase was devoted to the development of state-of-the-art neural machine translation (NMT) systems and models for natural language inference (NLI). We proposed our own architecture for what we call “hierarchical refinement encoders” of natural sentences for the NLI task (Talman et al., 2019) and we developed a novel model for multilingual machine translation based on a shared intermediate layer that learns language-agnostic meaning representations (Vázquez et al., 2020). The latter model is called the “attention-bridge model” as it connects independent source language encoders with similarly independent target language decoders via so-called attention links, which summarize semantic information from the input and provide contextualised meaning representations to the output text generator.

We have tested the model in various settings and a detailed analysis of the model’s behaviour is available in Vázquez et al. (2020) showing that knowledge can be transferred from one language to another leading to improved translation quality and the possibility to translate between languages without explicit training data. Furthermore, our experiments support the claim that multilingual setups lead to improved abstractions, which becomes visible in semantic probing tasks and downstream applications that require natural language understanding. We carefully studied the learning dynamics of neural translation models and compared the behavior with language models and different training objectives.

Another line of research that we stress is the interpretation of neural models and the analyses of their generality in terms of their cross-domain and cross-task applications. Neural models are non-transparent black boxes and notoriously difficult to understand. We shed some light on the behaviour of neural translation models and published our work on the linguistic interpretation of model patterns in several publications (Raganato and Tiedemann, 2018; Vázquez et al., 2020). Furthermore, we performed important experiments on the lack of generality in state-of-the-art approaches to NLI showing that such models may fail across domains (Talman and Chatzikyriakidis, 2019).

In the final period we worked on scaling up our modular translation model and looked into the application of multilingual sentence representations in downstream applications. A software framework has been released and widely disseminated (Mickus et al., 2024). We studied the parameters of neural models and how they can be optimised and interpreted. Furthermore, we proposed a methodology for exploring the lexical semantic knowledge encoded in multilingual language models that can be used to build representations of abstract semantic properties with relevance for downstream applications (Garí Soler and Apidianaki, 2021).

We concluded the project with two international workshops, one closing symposium with invited speakers organised in Helsinki and one international workshop on modular and open multilingual NLP (MOOMIN) co-located with EACL 2024.
Our modular toolkit MAMMOTH implements a novel, scalable framework that can easily be extended by additional languages, which well supports our goals to dramatically increase language coverage of multilingual translation models. It supports various architectures with partially-shared components including an “attention-bridge model” that implements a cross-lingual bottleneck to learn language-agnostic meaning representation that can directly be applied to other downstream tasks. The architecture provides a flexible framework that can further be augmented with multimodal data such as spoken language or even images. This enables grounded semantic representations that combine evidence coming from various languages and their connection to audio-visual features. Furthermore, our efforts in explainability and interpretation of neural models increase the ability to understand black-box systems and the linguistic information they pick up from raw data. The work opens many directions for future research, including the development of efficient and scalable architectures for multilingual language and translation modeling.
project hypothesis
project logotype
project setup