Skip to main content

Disagreements and Language Interpretation

Periodic Reporting for period 3 - DALI (Disagreements and Language Interpretation)

Reporting period: 2019-01-01 to 2020-06-30

Different people do not always interpret what they read or hear in the same way. Even when looking at the simplest aspects of language interpretation, such as choosing an interpretation for anaphoric expressions in context, we find that different people do not always agree as to their interpretation; the percentage of expressions on which people disagree ranges from 10% to 40% depending on the genre (Poesio and Reyle, 2001, 2003; Poesio and Artstein, 2005; Versley, 2008; Recasens et al, 2011). This may not terribly surprising: after all, our minds are all slightly different, why should we interpret language in exactly the same way? Yet the assumption that it is always possible to identify the unique intended meaning of a language expression in context, at least in the case of grammatical and felicitous language use, underlies much if not most current research in linguistics, psycholinguistics, and especially computational linguistics (also known as Natural Language Processing, or NLP). The objective of the DALI project is to shed some light on the issue of disagreements in language interpretation, focusing in particular on anaphora. We are collecting large amounts of data about disagreement, using games-with-a-purpose (i.e. games designed to collect data as well as to entertain) to do so on a very large scale yet very cheaply. We are analysing such data, also using novel annotation analysis methods, to identify and study the genuine cases of disagreement. And we are developing novel models of anaphoric interpretation based on machine learning to explain how people can learn how to interpret language expressions without being certain as to their interpretation.
We believe this work will have significant implications, first of all, for our understanding of how humans communicate. Secondly, the project addresses a fundamental limitation of current approaches to the development of NLP systems; its theoretical contributions and, even more importantly, the resources it is creating, will facilitate new ways of approaching the problem, which may well result in improved performance for tasks such as anaphora resolution in which the relatively low performance of current systems is in part due to uncertainty about the interpretation of a number of expressions. Third, a number of applications of NLP technology may benefit from systems capable of identifying cases in which the interpretation of an expression is not entirely clear. For instance, writers of instructions manuals would want to know if the text they write can be misunderstood leading to what Willis et al call nocuous ambiguity (Yang et al, COLING 2010).
The scientific activities in the first half of the DALI project have by and large proceeded as planned in the proposal, although some interesting new directions of research have also emerged. Very promising research has been carried out in all directions. A lot of effort has also been devoted to hiring high quality personnel–we’re very pleased we managed to hire several outstanding researchers-and dissemination, in particular in collaboration with the LDC.
In WP1, our work on redesigning our own Game-With-A-Purpose (GWAP) Phrase Detectives (Poesio et al., 2013) led to the release of a new version of the existing game, to a new, more game-like activity for anaphora called Wormingo, but also to the notion of ‘gamifying the pipeline’ and to the implementation of games to correct the output of the stages of the NLP pipeline prior to anaphora resolution–in particular, of the GWAP Tile Attack! to correct markables (Madge et al., 2017b).
In WP2, we carried out the first (to our knowledge) systematic comparison among the Bayesian annotation models most widely used in NLP (Paun et al., 2018a). We also developed the first Bayesian annotation model able to aggregate anaphoric data, Mention Pair Annotation (MPA) (Paun et al., 2018b). This achievement, not expected until much later in the project, has substantially accellerated many of our activities, in particular in WP 4 and WP 5.
In WP3, we released a new and much larger version of the Phrase Detectives corpus, called Phrase Detectives 2 (Poesio et al., 2019). This new dataset has been extensively used in our research, in particular in WP 4 and WP 5, where we carried out the first preliminary analysis of ambiguity in the Phrase Detectives corpus, and developed the first anaphoric resolver able to recognize non-referring expressions.
This period also saw a lot of activity to promote our research, including the organization of two GAMES4NLP workshops, of the CRAC workshop on anaphora and coreference at NAACL 2018 and the associated shared task, and of the forth- coming ANNONLP workshop on Bayesian annotation models at EMNLP 2019. A key aspect of our strategy in this respect is the embedding of our games in Lingo Boingo, a new Citizen Science platform for NLP developed in collaboration with the US Linguistic Data Consortium (LDC).
The project already delivered
1. the first large-scale crowdsourced corpus including multiple annotations for each item for anaphoric information, or indeed any other NLP area (the Phrase Detectives corpus, release 2)
2. the first probabilistic annotation method for aggregating crowdsourced anaphoric information (Mention Pair Annotation)
3. the first proper games (as opposed to gamified platforms) for POS tagging (WordClicker) and noun phrase identification (TileAttack!), and the first proper game for anaphoric annotation (Wormingo)
4. contributing to LDC’s development of LingoBoingo, the first portal of GWAPs for NLP
5. the first coreference scorer combining the calculation of coreference metrics with the evaluation of the accuracy of a system at identifying non-referring expressions (the Extended Reference Scorer introduced for the CRAC 2018 Shared Task)
Expected results until the end of the project include:
1. a probabilistic annotation method integrated with neural network calculation of priors to provide state-of-the-art annotations of all items in a crowdsourced anaphoric corpus (working name NNMPA)
2. subsequent releases of the Phrase Detectives corpus including the entire 1.2M words corpus, as more annotations are produced and NNMPA is used to infer labels for unannotated items
3. at least one release of the 10M words DALI corpus, covering several genres, and annotated using the gamified pipeline of WordClicker, TileAttack!, and Wormingo
4. turning Phrase Detectives into a proper Citizen Science platform for anaphoric annotation, allowing users to communicate with each other, and including a community of experts that can carry out more complex annotation tasks than so far done (e.g. bridging reference, discourse deixis)
5. Versions of the games `polished’ in collaboration with a game designer to make them more professional;
6. The development of the first anaphora resolution system training from probabilistic annotations (and thus exploiting information about disagreements) instead of gold annotations
7. The development of the first `soft’ scorer for coreference, assessing system performance on the basis of probabilistic annotations instead of gold annotations
8. The development of anaphoric resolvers able to understand the more complex cases of anaphoric reference covered in the ARRAU corpus (discourse deixis, split antecedent plurals)
9. The development of improved methods for detecting ambiguity in anaphoric reference in probabilistically annotated corpora
An analysis of ambiguity in anaphoric reference distinguishing between ‘justified’ ambiguity and `unjustified’ ambiguity