Interpreting text in the context of other texts is very hard: it requires understanding the fine-grained semantic relationships between documents, called intertextual relationships. This is critical in many areas of human activity, including research, business, journalism, and others. However, finding and interpreting intertextual relationships and tracing information throughout heterogeneous sources remains a tedious manual task. Natural language processing (NLP) fails to adequately support it. Even in the age of large language models (LLMs), mainstream NLP considers texts as static, isolated entities, and existing approaches to cross-document understanding focus on narrow use cases and lack a common theoretical foundation. Data is scarce and difficult to create, and the field lacks a principled framework for modelling intertextuality.
InterText breaks new ground by proposing the first framework for the computational study of intertextuality in the age of LLMs. We shift the NLP paradigm from the analysis of isolated texts to the analysis of evolving digital documents, interpreted in context, and explore the next frontier in LLM research – understanding long documents in context. We advance NLP for living texts along three dimensions: linking investigates the relationships between related texts; versioning focuses on the relationships between texts and their revisions; implicit commentary studies the texts in relation to annotations made on top of them. We create foundational datasets and models for the computational analysis of these relationships, and develop robust formalisms and modular representations to incorporate cross-document context into natural language processing. To anchor our work in real-life applications, we apply our findings in two critical domains: academic peer review and conspiracy theory debunking. The ground-breaking research of InterText creates a foundational platform for intertextuality-aware NLP, crucial for managing the dynamic, interconnected digital discourse of today.