Modelling Text as a Living Object in Cross-Document Context

Project Information

InterText

Grant agreement ID: 101054961

DOI

10.3030/101054961

EC signature date 15 September 2022

Start date 1 April 2023

End date 31 March 2028

Funded under

European Research Council (ERC)

Total cost

€ 2 499 721,00

EU contribution

€ 2 499 721,00

2 499 721,00

Coordinated by

TECHNISCHE UNIVERSITAT DARMSTADT
Germany

Periodic Reporting for period 1 - InterText (Modelling Text as a Living Object in Cross-Document Context)

Reporting period: 2023-04-01 to 2025-09-30

Interpreting text in the context of other texts is very hard: it requires understanding the fine-grained semantic relationships between documents, called intertextual relationships. This is critical in many areas of human activity, including research, business, journalism, and others. However, finding and interpreting intertextual relationships and tracing information throughout heterogeneous sources remains a tedious manual task. Natural language processing (NLP) fails to adequately support it. Even in the age of large language models (LLMs), mainstream NLP considers texts as static, isolated entities, and existing approaches to cross-document understanding focus on narrow use cases and lack a common theoretical foundation. Data is scarce and difficult to create, and the field lacks a principled framework for modelling intertextuality.

InterText breaks new ground by proposing the first framework for the computational study of intertextuality in the age of LLMs. We shift the NLP paradigm from the analysis of isolated texts to the analysis of evolving digital documents, interpreted in context, and explore the next frontier in LLM research – understanding long documents in context. We advance NLP for living texts along three dimensions: linking investigates the relationships between related texts; versioning focuses on the relationships between texts and their revisions; implicit commentary studies the texts in relation to annotations made on top of them. We create foundational datasets and models for the computational analysis of these relationships, and develop robust formalisms and modular representations to incorporate cross-document context into natural language processing. To anchor our work in real-life applications, we apply our findings in two critical domains: academic peer review and conspiracy theory debunking. The ground-breaking research of InterText creates a foundational platform for intertextuality-aware NLP, crucial for managing the dynamic, interconnected digital discourse of today.

During the first two years of the project, we established a solid foundation for research in cross-document understanding. Inspired by intertextuality theory, we created the first formal framework for general study of cross-document relations in NLP, including novel task formulations for cross-document linking, versioning and pragmatic analysis. We created foundational datasets and methods for finding connections between related documents, analysing and summarizing revision histories, and detecting the reader’s intent. From the methodological perspective, we developed novel methods for quantifying and improving structure-awareness in long-document Transformer models, created the first cross-lingual cross-domain benchmark for modular transfer learning, and pioneered the applications of large language models to cross-document modeling. On the application side, our work helped establish a new, highly active line of research on natural language processing for peer reviewing assistance, creating large open datasets, challenging tasks, data collection initiatives and competitions to drive this line of work forward. Our unique collaborative reading platform – CARE – paves the path towards the principled study of inline commentary and interactive AI assistance for critical reading, peer feedback and education.

The work of InterText has substantially advanced state of the art in cross-document modeling and beyond, and we expect our results to have major impact on the research in cross-document understanding, responsible data collection, long document processing, efficient knowledge transfer, and AI for peer review. Our intertextual framework for the first time allows general study of cross-document relations across domains and application scenarios. It will help unify previously disjoint lines of research, and transfer the methods and findings between them. Our cutting-edge methods for machine-assisted annotation of cross-document links allow scalable and efficient cross-document analysis in new domains, and enable new applications of intertextual analysis to media framing, conspiracy theory debunking, fact-checking, education and digital humanities. Our pioneering efforts in confidentiality- and copyright-aware data collection set new standards for responsible data handling in natural language processing, and can help design data collection policies in other domains. Our methodological contributions pave the path for new applications of large language models for classification tasks, modular transfer learning, and the use of LLMs for cross-document analysis. Our work on structure-awareness opens new opportunities for efficient and effective use of document structure to improve machine understanding of long documents. Finally, our foundational work on cross-document understanding in peer reviewing assistance enables novel applications of AI to academic quality control, and will set the research agenda in this area for years to come.

InterText: natural language processing for living texts, in context.

Periodic Reporting for period 1 - InterText (Modelling Text as a Living Object in Cross-Document Context)

Download Download the content of the page