Periodic Reporting for period 4 - FoTran (Found in Translation – Natural Language Understanding with Cross-Lingual Grounding)
Berichtszeitraum: 2023-03-01 bis 2024-03-31
The goal of the FoTran project is to develop models for natural language understanding trained on implicit information given by large collections of human translations. We apply massively parallel data sets of up to a thousand languages to acquire meaning representations that can be used for reasoning with natural languages and for multilingual neural machine translation.
A guiding principle in our research is what we call “cross-lingual grounding” and the effect of resolving ambiguities through translation. The beauty of that idea is the use of naturally occurring data, i.e. human translations, instead of artificially created resources and costly manual annotations in our machine learning approach. The general framework is based on deep learning and neural machine translation and the hypothesis is that training on increasing amounts of linguistically diverse data improves the abstractions found by the model.
The overall objectives of the project are:
* the development and implementation of models that can learn sentence-level semantics from massively parallel data sets
* the analyses and interpretation of the meaning representations that are learned
* the application of emerging representations as an “interlingua” in machine translation and for advanced reasoning with natural languages
Besides the scientific interest, the project objectives also have a strong societal impact. Language technology already plays a crucial role in human communication in the digital world. Improving the ability to properly understand language input will influence the capabilities of interactive systems and tools. The potential is enormous. Short term goals include the development of accurate translation services for many more languages with better abstraction, domain knowledge and context-awareness. Long term goals include the development of complex interactive and intelligent machines with human-like language interfaces and deeper world knowledge.
The Fotran project has contributed to all those aspects mentioned above. Multilingual tools and resources have been released for public use and further research. Scientific publications shed light on neural language and translation models and novel modular architectures support sustainable and reusable components.
We have tested the model in various settings and a detailed analysis of the model’s behaviour is available in Vázquez et al. (2020) showing that knowledge can be transferred from one language to another leading to improved translation quality and the possibility to translate between languages without explicit training data. Furthermore, our experiments support the claim that multilingual setups lead to improved abstractions, which becomes visible in semantic probing tasks and downstream applications that require natural language understanding. We carefully studied the learning dynamics of neural translation models and compared the behavior with language models and different training objectives.
Another line of research that we stress is the interpretation of neural models and the analyses of their generality in terms of their cross-domain and cross-task applications. Neural models are non-transparent black boxes and notoriously difficult to understand. We shed some light on the behaviour of neural translation models and published our work on the linguistic interpretation of model patterns in several publications (Raganato and Tiedemann, 2018; Vázquez et al., 2020). Furthermore, we performed important experiments on the lack of generality in state-of-the-art approaches to NLI showing that such models may fail across domains (Talman and Chatzikyriakidis, 2019).
In the final period we worked on scaling up our modular translation model and looked into the application of multilingual sentence representations in downstream applications. A software framework has been released and widely disseminated (Mickus et al., 2024). We studied the parameters of neural models and how they can be optimised and interpreted. Furthermore, we proposed a methodology for exploring the lexical semantic knowledge encoded in multilingual language models that can be used to build representations of abstract semantic properties with relevance for downstream applications (Garí Soler and Apidianaki, 2021).
We concluded the project with two international workshops, one closing symposium with invited speakers organised in Helsinki and one international workshop on modular and open multilingual NLP (MOOMIN) co-located with EACL 2024.