Periodic Reporting for period 3 - FoTran (Found in Translation – Natural Language Understanding with Cross-Lingual Grounding)
Okres sprawozdawczy: 2021-09-01 do 2023-02-28
The goal of the project is to develop models for natural language understanding trained on implicit information given by large collections of human translations. We apply massively parallel data sets of up to a thousand languages to acquire meaning representations that can be used for reasoning with natural languages and for multilingual neural machine translation.
A guiding principle in our research is what we call “cross-lingual grounding” and the effect of resolving ambiguities through translation. The beauty of that idea is the use of naturally occurring data, i.e. human translations, instead of artificially created resources and costly manual annotations in our machine learning approach. The general framework is based on deep learning and neural machine translation and the hypothesis is that training on increasing amounts of linguistically diverse data improves the abstractions found by the model.
The overall objectives of the project are:
* the development and implementation of models that can learn sentence-level semantics from massively parallel data sets
* the analyses and interpretation of the meaning representations that are learned
* the application of emerging representations as an “interlingua” in machine translation and for advanced reasoning with natural languages
Besides the scientific interest, the project objectives also have a strong societal impact. Language technology already plays a crucial role in human communication in the digital world. Improving the ability to properly understand language input will influence the capabilities of interactive systems and tools. The potential is enormous. Short term goals include the development of accurate translation services for many more languages with better abstraction, domain knowledge and context-awareness. Long term goals include the development of complex interactive and intelligent machines with human-like language interfaces and deeper world knowledge.
We have tested the model in various settings and a detailed analysis of the model’s behaviour is available in Vázquez et al. (2020) showing that knowledge can be transferred from one language to another leading to improved translation quality and the possibility to translate between languages without explicit training data. Furthermore, our experiments support the claim that multilingual setups lead to improved abstractions, which becomes visible in semantic probing tasks and downstream applications that require natural language understanding. We carefully studied the learning dynamics of neural translation models and compared the behavior with language models and different training objectives.
Another line of research that we stress is the interpretation of neural models and the analyses of their generality in terms of their cross-domain and cross-task applications. Neural models are non-transparent black boxes and notoriously difficult to understand. We shed some light on the behaviour of neural translation models and published our work on the linguistic interpretation of model patterns in several publications (Raganato and Tiedemann, 2018; Vázquez et al., 2020). Furthermore, we performed important experiments on the lack of generality in state-of-the-art approaches to NLI showing that such models may fail across domains (Talman and Chatzikyriakidis, 2019).
Currently, we work on scaling up our modular translation model and look into the application of multilingual sentence representations in downstream applications. We study the parameters of neural models and how they can be optimised and interpreted. Recently, we proposed a methodology for exploring the lexical semantic knowledge encoded in multilingual language models that can be used to build representations of abstract semantic properties with relevance for downstream applications (Garí Soler and Apidianaki, 2021) and we will continue to investigate the linguistic features of contextualised language representations.
Alessandro Raganato and Jörg Tiedemann. An Analysis of Encoder Representations in Transformer-Based Machine Translation. Proceedings of BlackboxNLP. 2018. DOI: 10.18653/v1/W18-5431
Aarne Talman and Stergios Chatzikyriakidis: Testing the Generalization Power of Neural Network Models Across NLI Benchmarks. Proceedings of BlackboxNLP, p. 85-94., 2019
Aarne Talman, Anssi Yli-Jyrä and Jörg Tiedemann: Sentence Embeddings in NLI with Iterative Refinement Encoders, In NLE. 25, 4, p. 467-482, 2019
Vázquez R., Raganato A., Creutz M. & Tiedemann J. (2020). A Systematic Study of Inner-Attention-Based Sentence Representations in Multilingual Neural Machine Translation. Computational Linguistics, Volume 46, Issue 2 - 2020 Pages: 387–424.
Garí Soler A. and Apidianaki M. (2021) Let's Play Mono-Poly: BERT Can Reveal Words' Degree of Polysemy and Partitionability into Senses. Transactions of the ACL (TACL) journal. MIT Press.