Skip to main content
Eine offizielle Website der Europäischen UnionOffizielle Website der EU
European Commission logo
Deutsch Deutsch
CORDIS - Forschungsergebnisse der EU
CORDIS
CORDIS Web 30th anniversary CORDIS Web 30th anniversary

IMAGINE – Informing Multi-modal lAnguage Generation wIth world kNowledgE

Periodic Reporting for period 2 - IMAGINE (IMAGINE – Informing Multi-modal lAnguage Generation wIth world kNowledgE)

Berichtszeitraum: 2021-03-14 bis 2022-03-13

Recently we have witnessed unprecedented improvements in the quality of computational models of human language, including models that use both images and text. Such models can, for instance, answer different questions about the content of images, i.e. Visual Question Answering (VQA). Despite apparently being able to solve complicated tasks like VQA or visual commonsense reasoning (VCR), we do not know the extent of the capabilities of these vision and language (V&L) models. Moreover, V&L models have recently been found to suffer from drawbacks such as generalising poorly on unseen or rare cases. For example, VQA models often learn to answer questions starting with "How many ..." with the number "2", since in its training data this is the most frequent answer to this type of question. When used in practice, these models' poor generalisation capabilities can also lead to different types of biased predictions or to the unfair representation of minority groups, for instance.

An important reason these models do not generalise well is the fact that the data they are trained on is biased, and they cannot efficiently "understand" and utilise human-curated knowledge present in structured knowledge graphs. In the IMAGINE project my main goal is to incorporate world knowledge to better learn state-of-the-art V&L models so that they better generalise to unseen or rare cases, and also so that we can mitigate issues related to bias and unfairness. I investigate ways to make V&L models transparently connect to knowledge graphs, and whether that leads to less bias and better generalisation. I also currently devise better datasets to train vision & language models, and better benchmarks to evaluate these models, as orthogonal strategies to gauge their capabilities.
From June 2019 until March 2021, I visited the Center for Data Science in New York University (NYU) where I collaborated with members of the Machine Learning and Language (ML2) lab. Already in the beginning of my visit, in August 2019, I presented a paper co-authored with colleagues in the University of Amsterdam (UvA) where we propose a latent variable model for multi-modal machine translation at the Annual Meeting of the ACL (2019) in Florence (Calixto et al., 2019). I also had an article published where I conduct an error analysis of multi-modal machine translation models published in the Machine Translation journal (Calixto and Liu, 2019). In the second semester of 2019, I have started many collaborations with members of the ML2 lab as well as with other researchers in different countries, which led to multiple publications. In July-August 2019 I hosted a former student I supervised in the University of Amsterdam's Master of AI and we worked together on a project involving the use of multi-modal structured knowledge in visual question answering. This work led to a paper published in the Asian Chapter of the ACL (Milewski et al., 2020). In January 2020, I led a large collaborative project with prof. Sam Bowman's group, which led to another paper published in the Asian Chapter of the ACL (Phang et al., 2020). The project investigated how to transfer learning in one language (English) and apply language models in a zero-shot manner in different languages and across different tasks.

Throughout 2020 I have supervised three Master students at NYU where we worked on a project involving representation learning for multi-modal knowledge graphs (Huang et al., 2022). Throughout 2020 I have also activated my research networks and worked on different projects: I proposed to use knowledge graphs to improve multilingual language models (with colleagues in Rome and Finland), which led to a paper published at NAACL 2021 (Calixto et al., 2021); I investigated the linguistic capabilities of pretrained vision and language models (with colleagues in Malta and Germany), which led to a paper published at the MMSR 2021 workshop (Parcalabescu et al., 2021); I co-authored a review article on multilingual/multimodal natural language generation published at the Journal of Artificial Intelligence Research (JAIR) in 2022 with colleagues in the COST Action Multi3Generation (Erdem et al., 2022); I published the VisualSem vision and language (V&L) dataset at the Multilingual Representation Learning workshop (Alberts et al., 2021), which includes text and images where concepts are part of a knowledge graph (e.g. Wikipedia) - this dataset is devised to support V&L model training and evaluation; I conducted a survey of the current landscape of NLP applications and resources for mental health and mental disorders (with colleagues in the USA and Brazil) which was published as a book chapter (Calixto et al., 2022); and, finally, I co-authored a paper accepted at the Annual Meeting of the ACL 2022 where we propose a benchmark to assess, in a systematic manner, what are the fine-grained linguistic capabilities and knowledge of pretrained V&L models (Parcalabescu et al., 2022).

I have published multiple papers/articles relevant to the IMAGINE project: one at ACL 2019, one at the MT Journal, two at AACL 2020, one at NAACL 2021, one at MMSR 2021, one at MRL 2021, one at EAMT 2022, and one at ACL 2022. I also have one pre-print we have not decided where to publish yet. I have presented my work at meet-ups and invited talks, and have co-organized the Representation Learning for NLP 2021 workshop, which is perceived as a very high-impact workshop in my area. In addition to presenting my work at conferences, I recently gave (or will give) the following invited talks: Dublin Machine Learning meet-up (September 2020), KU Leuven NLP Symposium (December 2020), Cardiff University (March 2021), Probabll lab at the University of Amsterdam (March 2021), RGCL Machine Learning and Deep Learning Seminar Series (June 2021), and the Helsinki NLP Research Seminar in Language Technology (June 2021), KUIS AI at Koc University (December 2021), and the Informatics Institute NLP Lab at the Federal University of Goias (April 2022).
In recent work, together with co-authors I have shown that using structured encyclopaedic knowledge from Wikipedia can considerably improve multilingual language models (Calixto et al., 2021). This is an interesting first step in bridging large (neural) language models with multilingual structured knowledge graphs seamlessly by leveraging the data in these knowledge graphs and fine-tuning language models to predict entities from these graphs. In addition to better training V&L models, I worked on an effort to better understand what pretrained V&L models really know, which resulted in two papers. The first work was an in-depth investigation of how well V&L models can count and handle numeracy, published at MMSR 2021 (Parcalabescu et al, 2021), and in the second work I proposed (again together with colleagues) a benchmark with the goal of assessing the capabilities of V&L models in a systematic manner and from a linguistic point-of-view (Parcalabescu et al., 2022). The latter work was presented at the Annual Meeting of the ACL in May 2022 as an oral presentation.

VisualSem is available for research purposes in: https://github.com/iacercalixto/visualsem. Together with colleagues, I have finalised the code to train a model to learn unsupervised multi-modal KB representations (Huang et al., 2022). This model uses VisualSem and is available in: https://github.com/iacercalixto/visualsem-kg.
VALSE, a benchmark to assess how vision-and-language models learn specific linguistic phenomena.
We propose ways to use structured knowledge from Wikipedia to improve cross-lingual language models.