Skip to main content

A distributional MOdel of Reference to Entities

Periodic Reporting for period 3 - AMORE (A distributional MOdel of Reference to Entities)

Reporting period: 2020-02-01 to 2021-07-31

Imagine your GPS could see. To answer the question "Do I turn there where that big tree is?", a camera is not enough; the GPS needs to connect what you say to the portion of reality that surrounds your car. We use language to talk about the world, and that includes being able to identify entities and link them with specific expressions in our output. AMORE enables computers to connect language to reality, and seeks an understanding of how people make this connection when they talk. The project thus explores the phenomenon of reference in natural language via computational modeling experiments, and we are particularly interested in the interaction of language with conceptual knowledge, on the one hand, and the extralinguistic context, on the other.

The crucial tenet of AMORE is that conceptual and referential aspects of meaning interact, and this has an impact both to understand language and to model it computationally. By "conceptual", we mean generic aspects like the word "box" being associated with physical objects and physical objects having colors. By "referential", we mean situation-specific uses of linguistic expressions, like "red box" being used for a particular brown box containing red objects. The project has two main goals, namely to advance 1) our understanding of how conceptual and referential aspects of meaning interact, 2) the computational modeling of language, with the hypothesis that models that have an explicit bias towards modeling referents will fare better. We use state of the art Machine Learning techniques, as well as theoretical analyses of linguistic phenomena.

The main challenges we address are:

- Identifying which entities ("that big tree") are being talked about;

- Tracking the entities as they are mentioned again ("that one"), retrieving and adding new information about them as needed;

- Crucially, having the machine learn these two abilities directly from examples of how people use language.

We face the machine with different tasks that require using language to talk about the world, and it progressively learns to represent both the entities and the language that we use to refer to them. Specifically, we test our computational model in referential tasks that require matching noun phrases (such as "the big tree") with entity representations extracted from text and images.

This project is important for society because it enables a better understanding of the our vehicle for thought, which is language, and it makes progress in computer-human interaction, helping technologies to better support us in our everyday life.
We have provided a better understanding of why an approach to word meaning classic in Computational Linguistics, distributional semantics, can account for conceptual but not referential aspects of meaning: it captures general properties of the word "tree", but it has no means to anchor its generic representations to the here and now, to the specific tree that is being talked about on a specific occasion.

Newer generation neural networks hold promise of accounting for some referential aspects, as they explicitly model the context in which utterances are spoken and interpreted. We are exploring their potential, but in our experiments so far they have shown to fall short of accounting for many aspects of context. The main technical innovation of AMORE is the incorporation of a memory module to store information about entities. A version of this model won an international competition on learning how to match mentions in a dialogue with the corresponding characters. The dialogue was from the series Friends. For instance, given the sentence "Ross, you love this woman", the system should identify "Ross" and "you" as referring to the character ROSS, "this woman" to the character RACHEL. Our system was simpler than competitors, and we argued that it was able to perform better because its theoretically well-founded architecture. However, in further analysis we showed that neither our model nor another memory-based one were able to model character properties, such as their gender. We conclude that while the bias towards modeling entities is useful, current models implementing this bias are still far from accounting for entities. Similarly, our analysis of current LSTM-based language models shows their limitations in accounting for referential aspects: it suggests that they still heavily rely on lexical regularities rather than situation-specific information, and that, while they profitably use morphosyntactic features, they do not capture not a more global notion of entity.

To further explore the interaction between lexico-conceptual knowledge and contextual knowledge (in this case, object properties and visual context), we are creating a visual dataset annotated with referential information. We have started by collecting object names, with 36 annotations per image in a collection of 25,000 images extracted from a previously created dataset. In this line of grounding language in visual context, we have also studied how situation-centric multimodal object representations can be learnt by grounding semantic roles in the corresponding image regions, and how multi-tasking allows us to better learn quantifiers describing specific images.

Finally, we are modeling other aspects of utterance context that affect reference. In particular, interpreting referring expressions requires an understanding not just of entities, but also of which subset of entities are actually relevant to the discourse goals (often termed 'Questions Under Discussion'/'QUDs'). Besides contributing theoretical work on this topic, we have worked on neural network models for predicting discourse goals and, to test and analyze such models, we are currently collecting a corpus of human annotations that make implicit discourse goals explicit.
We have provided a deeper understanding of computational models of language, and we have linked them to theoretical results on meaning. We expect to obtain the following results:

- a better understanding of how, and to what extent, current neural network-based models account for contextual aspects of meaning (in particular, referential aspects);

- a validation of the hypothesis that memory-augmented neural networks can better account for language as referring to entities in the real world;

- an understanding of the factors that intervene in people's choice of names for objects in visual scenes, and how that impacts computational models of naming;

- a better understanding of how the context of use influences reference, and how reference in turn feeds back to the organization of the lexicon.
Smartphone with navigation system inside a car; CC0 license