Skip to main content

Multi-modal Context Modelling for Machine Translation

Periodic Reporting for period 3 - MultiMT (Multi-modal Context Modelling for Machine Translation)

Reporting period: 2018-10-01 to 2020-03-31

Automatically translating human language has been a long sought-after goal in the field of Natural Language Processing (NLP). Machine Translation (MT) can significantly lower communication barriers, with enormous potential for positive social and economic impact. The dominant paradigm is Neural Machine Translation (NMT), which learns to translate from human-translated examples. Human translators have access to a number of contextual cues beyond the actual segment to translate when performing translation, for example images associated with the text and related documents. NMT systems, however, completely disregard any form of non-textual context and make little or no reference to wider surrounding textual content. This results in translations that miss relevant information or convey incorrect meaning. Such issues drastically affect reading comprehension and may make translations useless. This is especially critical for user-generated content such as social media posts – which are often short and contain non-standard language – but applies to a wide range of text types. The goal of MultiMT is to devise methods and algorithms to exploit global multi-modal information for context modelling in NMT. The project has focused on new ways to acquire multilingual multi-modal representations, and
new machine learning and inference algorithms that can process rich context models. Thus far, the focus was mainly on visual cues from images and metadata including topic and speaker. Multimodal and multilingual datasets with images have been created, a new area of research as been established and novel approaches to explore this information have been devised. These approaches lie at the intersection of natural language processing, computer vision and machine learning and have lead to significant improvements in translation quality, showing that additional modalities do contribute to better language understanding and generation.
The first half of the project focused mainly on visual information and on addressing two types of critical problems in translation: mistranslation (i.e. incorrect translations due to ambiguities or vagueness in language) and unknown words (i.e. words that have not been seen by the model and cannot be translated). The results are as follows:

1) Three datasets with images and their textual descriptions: (i) Multi30K, a manually translated dataset with 31K image descriptions in English and their translations in German, French and Czech, (ii) How2: a semi-automatically translated dataset of instructional videos from English to Portuguese with 200K segment pairs and (iii) MultiSubs, an automatically collected dataset with 1-4 million sentence pairs for 4 language pairs with corresponding images;
2) Novel approaches to multimodal neural machine translation and image captioning, including approaches using inferred metadata, such as domain, and acoustic cues; and
3) New evaluation metrics for multimodal machine translation focusing on ambiguous words, and for image captioning and multimodal machine translation focusing on detecting descriptions/translation that are correct according to the content of the image (i.e. objects).
We proposed the following novel approaches to advance the state of the art in multimodal machine translation, image captioning and their evaluation:

1) Multimodal machine translation: these include statistical approaches that combine captioning and translation models, where the output of captioning models is used to reinforce the statistics of candidate translations (Lala et al., 2017); approaches to re-rank the output of statistical models (Shah et al., 2016) or neural models according to how well they perform in the task of disambiguating ambiguous words, as judged by multimodal word sense translation models (Lala et al., 2018); approaches to learn structured visual information, where this information is used to condition the neural text decoder (Madhyastha et al., 2017) or given to the model for joint learning of how to translate and align regions in images and words in the source sentence (Specia et al., 2019-toAppear); neural approaches using acoustic keys representing domain and speaker information (Deena et al., 2017a, 2017b), approaches using deliberation neural networks to improve translation using a second decoder enhanced with visual information (Ive et al., 2019-toAppear);

2) Image captioning: this is a task closely related to machine translation where the language generation component is conditioned on the image only, instead of on the image and source text. We address this task to explore datasets available for one language, which are much larger than those for machine translation. Our novel approach focuses on using structured object information (presence of objects, their frequency and position) rather than abstract, dense representations, the common practice in the field (Wang et al., 2018);

3) Multimodal lexical translation: instead of translating the whole sentence, lexical translation focuses on ambiguous words, which are arguably harder to translate with text-only models. The approaches we propose treat the problem using recurrent neural networks for tagging or classification, where the visual information is explored through attention mechanisms. The input for such models is a word in its context, the corresponding image (or the region in the image corresponding to the ambiguous word), and the output is the translation of the word in a target language (Lala et al., 2019-toAppear);

4) Visual content-informed evaluation metric: we propose a novel image-aware metric, VIFIDEL, for the tasks of evaluating image description generation and multimodal translation systems. It estimates the faithfulness of the generated caption with respect to the content of the actual image, based on the semantic similarity between explicit image information and words in the description/translation (Madhyastha et al., 2019-toAppear);

5) Phrase localization in images: this task is important for aligning words and objects for learning multimodal translation models from structured information. We proposed the first fully unsupervised method for this problem, where no training data or training procedure is required. It uses automatic detections objects, scenes and colours in images, and explores different approaches to measure semantic similarity between the categories of detected visual elements and words in phrases (Wang and Specia, 2019-toAppear).