Periodic Reporting for period 1 - ModERN (Modelling Enlightenment: Reassembling Networks of Modernity through data-driven research)
Reporting period: 2022-09-01 to 2025-02-28
ModERN’s key goal is to establish a new ‘data-driven’ literary and intellectual history of the French Enlightenment; one that is both more comprehensive and more systematic in terms of its relationship to the existing digital cultural record, and one that challenges subsequent narratives of European Modernity. Specifically, the project will employ new data-intensive computational techniques to identify and analyse conceptual and intertextual networks over an unprecedented collection of 18th-century texts. Rejecting the top-down model of much previous scholarship, which tends to focus on pre-established sets of authors or ideas, ModERN will instead let the data speak first; extracting new assemblages of authors, texts, and ideas based on both quantitative and qualitative measures of significance and influence. By following these data, i.e. tracing the myriad intertextual relationships that tie Enlightenment actors (authors, ideas, texts) to each other and to their 19th-century successors, ModERN aims to reassemble and re-evaluate the networks of influence and authority that advanced (or opposed) the Enlightenment project; networks whose structure, composition, and coverage will likely destabilise, or even overturn accepted genealogies.
In Y2 we ran the Text-PAIR sequence aligner over the combined corpus to identify 18th-century text reuses. This process generated 2.5 million initial text alignments, which formed the basis of our initial experiments and led to multiple paper presentations and a published article in the journal Humanités Numériques. We also realised, somewhat surprisingly, that a large majority of these alignments could be classified as ‘noise’, e.g. repeated passages that occur in large quantities across texts but that cannot be considered a ‘reuse’ and therefore outside the bounds of our project. We thus set about conceiving of an automated method of filtering this noise from the alignments leveraging the power of deep-learning algorithms and language models to train a model to recognise paratextual noise in identified alignments (see below). Once the filter was deployed, we were able to eliminate almost 90% of the identified alignments as ‘noise’, leaving us with just over 250,000 pairwise reuses drawn from our main research corpus.
The team next developed a data model and structure for the above alignments, following semantic-web and open-science standards, which was implemented in PostgreSQL on the project server. A white paper was published on the project’s website that describes the data model (see below).
In Y3 we began work in two main areas: first, our new software engineer developed a web interface for annotating our filtered alignments collected in WP1 (see image 1). This interface allows team members to classify alignments as 'valid' 'invalid' or 'uncertain', as well as compile lists of annotated alignements for further model training and analysis. The resulting annotated data will
Work package 2 (WP2) is also well underway.we have begun analysing the sequence alignment data to model the most ‘influential’ authors and texts in the database. Moving forward, the we will evaluate 18th-century neural network language models trained using either context- independent word embeddings to generate conceptual maps and semantic tags for the alignments. WP3 will be repurposed to focus more intensely on the language-modelling and semantic analysis aspects while network analysis and alignment annotation will continue in WP2.
The codebase for this deep-learning paratextual filter was published on our project’s GitLab:
https://gitlab.huma-num.fr/groe/modern/-/blob/a1804b986b8a1a69a4e1f12c104455e9f13c772e/BERT_textual_reuses_filter.ipynb(opens in new window)
We have also recently developed a user interface for the annotation and analysis of our filtered alignments (see image 2). This will allow team members to eliminate invalid alignments that were not filtered in phase one, and also train new models for alignment filtering and semantic analysis. Our new software engineer will also develop a generative-AI based suggestion system to present users with possible positive and negative alignements, which will streamline the annotation process when we move towards other datasets such as the pamphlets and press. To our knowledge, no such suggestion system exists for text reuse research software. The codebase is available at the project's Gitlab.