Skip to main content

IMAGINE – Informing Multi-modal lAnguage Generation wIth world kNowledgE

Periodic Reporting for period 1 - IMAGINE (IMAGINE – Informing Multi-modal lAnguage Generation wIth world kNowledgE)

Reporting period: 2019-06-14 to 2021-03-13

Recently we have witnessed unprecedented improvements in the quality of computational models of human language, including models that use both images and text. Such models can, for instance, answer different questions about the content of images, i.e. Visual Question Answering (VQA). Despite apparently being able to solve complicated tasks like VQA or visual commonsense reasoning, we do not know the extent of the capabilities of these vision and language (V&L) models. Moreover, V&L models have recently been found to suffer from drawbacks such as generalizing poorly on unseen or rare cases. For example, VQA models often learn to answer questions starting with "How many ..." with the number "2", since in its training data this is the most frequent answer to this type of question. When used in practice, these models' poor generalization capabilities can also lead to different types of biased predictions or to the unfair representation of minority groups, for instance.

An important reason these models do not generalize well is the fact that the data they are trained on is biased, and they cannot efficiently "understand" and utilize human-curated knowledge present in structured knowledge graphs. In this project, my main goal is to incorporate world knowledge to better learn state-of-the-art V&L models so that they better generalize to unseen or rare cases, and also so that we can mitigate issues related to bias and unfairness. I investigate ways to make V&L models transparently connect to knowledge graphs, and whether that leads to less bias and better generalization. I also currently devise better datasets to train vision & language models, and better benchmarks to evaluate these models, as orthogonal strategies to gauge their capabilities.

This work's societal impact is possibly very large, since state-of-the-art models that operate over text and images are applied by companies across markets and areas of knowledge, including for instance translation, localization, and healthcare. However, the impact is not instantaneous since the ideas we propose (upon proving successful in a research setting) usually take some time to become widely adopted in the industry. I believe that assessing what current state-of-the-art V&L models do and cannot do well is a very important first step, which I also do as part of the IMAGINE project.
From June 2019 until March 2021, I have visited the Center for Data Science in New York University (NYU) where I collaborated with many members of the Machine Learning and Language (ML2) lab. Already in the beginning of my visit, in August 2019, I presented a paper co-authored with colleagues in the University of Amsterdam (UvA) where we propose a latent variable model for multi-modal machine translation at the Annual Meeting of the ACL (2019) in Florence, and I had an article where I conduct an error analysis of multi-modal machine translation models published in the Machine Translation journal. In the second semester of 2019, I have started many collaborations with members of the ML2 lab as well as with other researchers in different countries, which led to multiple publications. In July-August 2019 I hosted a former student I have supervised in the University of Amsterdam's Master of AI and we have worked together on a project involving the use of multi-modal structured knowledge in visual question answering. This work led to a paper published in the Asian Chapter of the ACL (AACL 2020). In January 2020, I led a large collaborative project with prof. Sam Bowman's group, which led to another paper published in the Asian Chapter of the ACL (AACL 2020). The project investigated how to transfer learning in one language (English) and apply models in a zero-shot manner in different languages and across different tasks.

Throughout 2020 I have supervised three Master students at NYU where we worked on a project involving representation learning for multi-modal knowledge graphs. Throughout 2020 I have also activated my research networks and worked on different projects: I proposed to use knowledge graphs to improve multilingual language models (with colleagues in Rome and Finland), which led to a paper accepted at NAACL 2021; I investigate the linguistic capabilities of pretrained vision and language models (with colleagues in Malta and Germany), which led to a paper accepted at the MMSR 2021 workshop; I work on one review on multilingual/multi-modal natural language generation (current under review at JAIR) and one review on efficient machine learning for language generation (in preparation), both with colleagues in the COST Action Multi3Generation; and I conduct a survey of the current landscape of NLP applications and resources for mental health and mental disorders (with colleagues in the USA and Brazil) and have one book chapter (under review) and one journal article (in preparation).

In addition to presenting my work at conferences, I either recently gave or will give the following invited talks: Dublin Machine Learning meet-up (September 2020), KU Leuven NLP Symposium (December 2020), Cardiff University (March 2021), Probabll lab at the University of Amsterdam (March 2021), RGCL Machine Learning and Deep Learning Seminar Series (June 2021), and the Helsinki NLP Research Seminar in Language Technology (June 2021).
In recent work accepted at NAACL 2021 we have shown that using structured encyclopedic knowledge from Wikipedia can considerably improve multilingual language models. This is an interesting first step in bridging large (neural) language models with multilingual structured knowledge graphs seamlessly by leveraging the data in these knowledge graphs and fine-tuning language models to predict entities from these graphs. I plan to associate images to knowledge graphs such as Wikipedia and assess whether this visual information helps language models generalize better and/or in different ways, and to evaluate whether the same trend we observed with (text-only) language models also hold with models that operate over vision & language.

In addition to better training V&L models, I am currently working on a benchmark where the goal is to assess the capabilities of V&L models in a systematic manner and from a linguistic point-of-view. I also plan to include energy-efficiency as a dimension in the evaluation. I expect that this work will have considerable impact in the vision and language research community, since it will pin-point the main drawbacks in current state-of-the-art V&L models and it will also provide quantitative evidence for what are the best ways forward. Training language models and V&L models is expensive and consumes more and more energy as models become larger and larger. One of the broader impacts I expect with this work is to be able to pin-point which model architectures and training procedures really deliver what they promise (e.g. generalize well in unseen / out-of-distribution examples), in an energy-efficient manner. Hopefully this will lead to more targeted investigations, and ultimately to more energy-efficient models.
We propose ways to use structured knowledge from Wikipedia to improve cross-lingual language models.