Periodic Reporting for period 4 - DALI (Disagreements and Language Interpretation)
Berichtszeitraum: 2020-07-01 bis 2022-01-31
We believe this work will have significant implications, first of all, for our understanding of how humans communicate. Secondly, the project addressed a fundamental limitation of current approaches to the development of NLP systems; its theoretical contributions and, even more importantly, the resources it created, will facilitate new ways of approaching the problem, which may well result in improved performance for tasks such as anaphora resolution in which the relatively low performance of current systems is in part due to uncertainty about the interpretation of a number of expressions. Third, a number of applications of NLP technology may benefit from systems capable of identifying cases in which the interpretation of an expression is not entirely clear. For instance, writers of instructions manuals would want to know if the text they write can be misunderstood leading to what Willis et al call nocuous ambiguity.
In WP1 (the games workpackage), we completely redesigned our existing game-with-a-purpose (GWAP) Phrase Detectives, which is still collecting large amounts of data (the total amount of data collected with Phrase Detectives doubled during DALI, to over 5 million judgments). But we also designed entirely new games, including a more game-like activity for anaphora called Wormingo. We also aimed at ‘gamifying the pipeline,’ i.e. developing games for all aspects of language interpretation. In particular, we developed a new game for POS tagging called WordClicker, and a game called Tile Attack! to identify markables. All these games were integrated in a unified platform called LingoTowns targeted at language learners, and offering the opportunity to collect data at all levels.
In WP2, data analysis, we carried out the first (to our knowledge) systematic comparison among the probabilistic annotation models most widely used in NLP (Paun et al., 2018a). We also developed the first probabilistic annotation model to aggregate anaphoric data, Mention Pair Annotation (MPA) (Paun et al., 2018b). This early achievement substantially accelerated many of our activities, in particular in WP 4 and WP 5. This research was reported in a new monograph, Statistical Methods for Annotation Analysis.
In WP3, dataset creation, we released a much larger version of the Phrase Detectives corpus, Phrase Detectives 2 (Poesio et al., 2019), containing over 2.5 million judgments. This new dataset was extensively used in our research, in particular in WP 4 and WP 5, where we carried out the first preliminary analysis of ambiguity in the Phrase Detectives corpus, and developed the first anaphoric resolver able to recognize non-referring expressions. We are currently completing a third release, Phrase Detectives 3, which will contain more than 5 million judgments about over 1.2 million words of text.
In WP4, the linguistic package, we explored the cases of anaphoric disagreement found in previous work and during the project using a combination of corpus analysis, behavioral experiments, and computational psycholinguistics. First, we developed a taxonomy of cases of anaphoric disagreement, that was used to label a sample of the cases of disagreement in the Phrase Detectives corpus. Secondly, we ran experiments studying the differences in interpretation for cases of anaphoric disagreement due to mereological effects (Poesio, Reyle & Stevenson, 2001) and to plurality (Versley, 2008).
In the computational modelling package, WP5, we ran two separate lines of research. One of these aimed at developing computational models of anaphora resolution that could interpret the cases of anaphoric reference which previous research suggested resulting in the most disagreement, such as plural reference and bridging reference. A second line of research was concerned with developing machine learning approaches to learn NLP models directly from datasets concerning disagreements.
Throughout the project we also invested a lot of effort in disseminating the results of our research both to the scientific community, by organizing series of workshops in relevant areas, and to the general public, especially through social media. Our scientific dissemination included the organization of six Games and NLP workshops nurturing the community of researchers developing GWAPs for NLP, and of two CRAC workshops on anaphora and three associated shared tasks. In order to reach out to the general public we made a substantial effort with social media, starting Games and NLP channels together with the community on YouTube, Facebook, Twitter, and Instagram.
1. the first large-scale crowdsourced corpus including multiple annotations for each item for anaphoric information, or indeed any other NLP area (the Phrase Detectives corpus, release 2 and 3)
2. the first probabilistic annotation method for aggregating crowdsourced anaphoric information, Mention Pair Annotation;
3. the first proper games for POS tagging (WordClicker) noun phrase identification (TileAttack!), and anaphoric annotation (Wormingo)
4. the first gamified platform properly integrating GWAPs for language interpretation at multiple levels, LingoTowns.
5. the first anaphoric resolver able to interpret both referring and non-referring expressions (Yu et al 2020) and both single antecedent and split antecedent plurals (Yu et al, 2021).
6. the first anaphora scorer combining the calculation of coreference metrics with the evaluation of the accuracy at identifying non-referring expressions and the evaluation of bridging reference and discourse deixis (the Universal Anaphora scorer introduced for the CRAC 2018 Shared Task and further developed for CODI/CRAC 2021)
7. The development of the first `soft’ scorer for NLP, assessing system performance on the basis of probabilistic annotations instead of gold annotations