Disagreements and Language Interpretation

Periodic Reporting for period 4 - DALI (Disagreements and Language Interpretation)

Reporting period: 2020-07-01 to 2022-01-31

People do not always interpret what they read or hear in the same way. They disagree even on the simplest aspects of language interpretation, such as choosing an interpretation for anaphoric expressions like pronouns it or that in context. This may not sound terribly surprising: after all, we are all slightly different, why should we interpret language in exactly the same way? Yet the assumption that it is always possible to identify the unique intended meaning of a language expression in context, at least in the case of grammatical and felicitous language use, underlies much if not most current research in linguistics, psycholinguistics, and especially computational linguistics (also known as Natural Language Processing, or NLP). The objective of DALI was to shed light on the issue of disagreements in language interpretation, focusing in particular on anaphora. We did this by collecting large amounts of data about disagreement using games-with-a-purpose (i.e. games designed to collect data as well as to entertain), which allow data collection on a large scale very cheaply. We analysed such data, also using novel annotation analysis methods, to identify and study genuine cases of disagreement. And we developed novel approaches for machines to learn language interpretation tasks to explain how people may learn how to interpret language expressions without being certain as to their interpretation.
We believe this work will have significant implications, first of all, for our understanding of how humans communicate. Secondly, the project addressed a fundamental limitation of current approaches to the development of NLP systems; its theoretical contributions and, even more importantly, the resources it created, will facilitate new ways of approaching the problem, which may well result in improved performance for tasks such as anaphora resolution in which the relatively low performance of current systems is in part due to uncertainty about the interpretation of a number of expressions. Third, a number of applications of NLP technology may benefit from systems capable of identifying cases in which the interpretation of an expression is not entirely clear. For instance, writers of instructions manuals would want to know if the text they write can be misunderstood leading to what Willis et al call nocuous ambiguity.

The scientific activities in the DALI project by and large proceeded as planned in the proposal, although interesting new directions of research also emerged.
In WP1 (the games workpackage), we completely redesigned our existing game-with-a-purpose (GWAP) Phrase Detectives, which is still collecting large amounts of data (the total amount of data collected with Phrase Detectives doubled during DALI, to over 5 million judgments). But we also designed entirely new games, including a more game-like activity for anaphora called Wormingo. We also aimed at ‘gamifying the pipeline,’ i.e. developing games for all aspects of language interpretation. In particular, we developed a new game for POS tagging called WordClicker, and a game called Tile Attack! to identify markables. All these games were integrated in a unified platform called LingoTowns targeted at language learners, and offering the opportunity to collect data at all levels.
In WP2, data analysis, we carried out the first (to our knowledge) systematic comparison among the probabilistic annotation models most widely used in NLP (Paun et al., 2018a). We also developed the first probabilistic annotation model to aggregate anaphoric data, Mention Pair Annotation (MPA) (Paun et al., 2018b). This early achievement substantially accelerated many of our activities, in particular in WP 4 and WP 5. This research was reported in a new monograph, Statistical Methods for Annotation Analysis.
In WP3, dataset creation, we released a much larger version of the Phrase Detectives corpus, Phrase Detectives 2 (Poesio et al., 2019), containing over 2.5 million judgments. This new dataset was extensively used in our research, in particular in WP 4 and WP 5, where we carried out the first preliminary analysis of ambiguity in the Phrase Detectives corpus, and developed the first anaphoric resolver able to recognize non-referring expressions. We are currently completing a third release, Phrase Detectives 3, which will contain more than 5 million judgments about over 1.2 million words of text.
In WP4, the linguistic package, we explored the cases of anaphoric disagreement found in previous work and during the project using a combination of corpus analysis, behavioral experiments, and computational psycholinguistics. First, we developed a taxonomy of cases of anaphoric disagreement, that was used to label a sample of the cases of disagreement in the Phrase Detectives corpus. Secondly, we ran experiments studying the differences in interpretation for cases of anaphoric disagreement due to mereological effects (Poesio, Reyle & Stevenson, 2001) and to plurality (Versley, 2008).
In the computational modelling package, WP5, we ran two separate lines of research. One of these aimed at developing computational models of anaphora resolution that could interpret the cases of anaphoric reference which previous research suggested resulting in the most disagreement, such as plural reference and bridging reference. A second line of research was concerned with developing machine learning approaches to learn NLP models directly from datasets concerning disagreements.
Throughout the project we also invested a lot of effort in disseminating the results of our research both to the scientific community, by organizing series of workshops in relevant areas, and to the general public, especially through social media. Our scientific dissemination included the organization of six Games and NLP workshops nurturing the community of researchers developing GWAPs for NLP, and of two CRAC workshops on anaphora and three associated shared tasks. In order to reach out to the general public we made a substantial effort with social media, starting Games and NLP channels together with the community on YouTube, Facebook, Twitter, and Instagram.

The key deliverables of DALI by the end of the project include:
1. the first large-scale crowdsourced corpus including multiple annotations for each item for anaphoric information, or indeed any other NLP area (the Phrase Detectives corpus, release 2 and 3)
2. the first probabilistic annotation method for aggregating crowdsourced anaphoric information, Mention Pair Annotation;
3. the first proper games for POS tagging (WordClicker) noun phrase identification (TileAttack!), and anaphoric annotation (Wormingo)
4. the first gamified platform properly integrating GWAPs for language interpretation at multiple levels, LingoTowns.
5. the first anaphoric resolver able to interpret both referring and non-referring expressions (Yu et al 2020) and both single antecedent and split antecedent plurals (Yu et al, 2021).
6. the first anaphora scorer combining the calculation of coreference metrics with the evaluation of the accuracy at identifying non-referring expressions and the evaluation of bridging reference and discourse deixis (the Universal Anaphora scorer introduced for the CRAC 2018 Shared Task and further developed for CODI/CRAC 2021)
7. The development of the first `soft’ scorer for NLP, assessing system performance on the basis of probabilistic annotations instead of gold annotations

screen-shot-wormingo-anno-1.png

screen-shot-wordclicker.png

screen-shot-tileattack-anno-1.png

screen-shot-tileattack-anno-2.png

screen-shot-lingoboingo.png

screen-shot-wormingo-crossword.png

Periodic Reporting for period 4 - DALI (Disagreements and Language Interpretation)

Share this page

Download