Skip to main content
Go to the home page of the European Commission (opens in new window)
English English
CORDIS - EU research results
CORDIS

Modelling Enlightenment: Reassembling Networks of Modernity through data-driven research

Periodic Reporting for period 1 - ModERN (Modelling Enlightenment: Reassembling Networks of Modernity through data-driven research)

Reporting period: 2022-09-01 to 2025-02-28

The Age of Enlightenment was a crucial moment in our shared cultural and intellectual history, not just in Europe, but the world over. At its most basic level, the Enlightenment was a set of shared philosophical ideas, mobilised by a diverse group of writers who worked collaboratively to advance the cause of freedom over tyranny and knowledge over superstition. This movement is widely believed to have begun in France, coalescing around the great mid-century Encyclopédie, before spreading outward via the transnational Republic of Letters. The ModERN project thus takes as its starting point this particular time and space as a launching pad from which to interrogate the various narratives—rationalist, materialist, universalist, progressivist—ascribed to the Enlightenment and its attendant philosophical discourse of Modernity, both of which have significantly marked the course of Western democracies over the past two centuries.

ModERN’s key goal is to establish a new ‘data-driven’ literary and intellectual history of the French Enlightenment; one that is both more comprehensive and more systematic in terms of its relationship to the existing digital cultural record, and one that challenges subsequent narratives of European Modernity. Specifically, the project will employ new data-intensive computational techniques to identify and analyse conceptual and intertextual networks over an unprecedented collection of 18th-century texts. Rejecting the top-down model of much previous scholarship, which tends to focus on pre-established sets of authors or ideas, ModERN will instead let the data speak first; extracting new assemblages of authors, texts, and ideas based on both quantitative and qualitative measures of significance and influence. By following these data, i.e. tracing the myriad intertextual relationships that tie Enlightenment actors (authors, ideas, texts) to each other and to their 19th-century successors, ModERN aims to reassemble and re-evaluate the networks of influence and authority that advanced (or opposed) the Enlightenment project; networks whose structure, composition, and coverage will likely destabilise, or even overturn accepted genealogies.
From a technical perspective, we have worked diligently over the first 24 months to complete Work package 1 (WP1), i.e. constructing the main research corpus, converting available texts into the TEI-XML format and loading them into PhiloLogic, an open-source full-text search and analysis engine. Thanks to institutional agreements with the University of Chicago, Oxford, and the Bibliothèque nationale de France (BnF), we were able to federate existing digital collections into two main research corpora: ‘Canon’, which contains 3,500 transcribed texts with little or few errors; and ‘Archive’, which contains almost 10,000 texts that are the result of automated optical character recognition (OCR), and therefore highly variant in terms of noise and error-rates.

In Y2 we ran the Text-PAIR sequence aligner over the combined corpus to identify 18th-century text reuses. This process generated 2.5 million initial text alignments, which formed the basis of our initial experiments and led to multiple paper presentations and a published article in the journal Humanités Numériques. We also realised, somewhat surprisingly, that a large majority of these alignments could be classified as ‘noise’, e.g. repeated passages that occur in large quantities across texts but that cannot be considered a ‘reuse’ and therefore outside the bounds of our project. We thus set about conceiving of an automated method of filtering this noise from the alignments leveraging the power of deep-learning algorithms and language models to train a model to recognise paratextual noise in identified alignments (see below). Once the filter was deployed, we were able to eliminate almost 90% of the identified alignments as ‘noise’, leaving us with just over 250,000 pairwise reuses drawn from our main research corpus.

The team next developed a data model and structure for the above alignments, following semantic-web and open-science standards, which was implemented in PostgreSQL on the project server. A white paper was published on the project’s website that describes the data model (see below).

In Y3 we began work in two main areas: first, our new software engineer developed a web interface for annotating our filtered alignments collected in WP1 (see image 1). This interface allows team members to classify alignments as 'valid' 'invalid' or 'uncertain', as well as compile lists of annotated alignements for further model training and analysis. The resulting annotated data will

Work package 2 (WP2) is also well underway.we have begun analysing the sequence alignment data to model the most ‘influential’ authors and texts in the database. Moving forward, the we will evaluate 18th-century neural network language models trained using either context- independent word embeddings to generate conceptual maps and semantic tags for the alignments. WP3 will be repurposed to focus more intensely on the language-modelling and semantic analysis aspects while network analysis and alignment annotation will continue in WP2.
Currently, the majority of post-processing approaches for text-reuse data rely on pre-identified examples that can be excluded from the alignment data using a dictionary look-up; this approach is more than often ad hoc with significant gaps in coverage. To move beyond the state-of-the-art, we created a binary classification algorithm aimed at filtering paratextual reuses based on word-vectors extracted from the multilingual BERT artificial intelligence deep-learning model. Language models such as BERT contain context vectors that are effective in capturing semantic similarity and this capability was crucial for distinguishing genuine reuses from our definition of false positives. Among the specific models based on the BERT structure, we opted for the BertForSequenceClassification version, which we refined for the text classification task, fine-tuning the basic model using annotated training data for a specific binary classification task, adjusting the weights of the pre-trained model to adapt it to the new task. Using this filter allowed us to automatically identify a huge number of paratextual reuses that we could extract from our database, moving from 2.5 million alignments to just under 250,000 and thus reducing our evaluation set by an order of magnitude.

The codebase for this deep-learning paratextual filter was published on our project’s GitLab:
https://gitlab.huma-num.fr/groe/modern/-/blob/a1804b986b8a1a69a4e1f12c104455e9f13c772e/BERT_textual_reuses_filter.ipynb(opens in new window)

We have also recently developed a user interface for the annotation and analysis of our filtered alignments (see image 2). This will allow team members to eliminate invalid alignments that were not filtered in phase one, and also train new models for alignment filtering and semantic analysis. Our new software engineer will also develop a generative-AI based suggestion system to present users with possible positive and negative alignements, which will streamline the annotation process when we move towards other datasets such as the pamphlets and press. To our knowledge, no such suggestion system exists for text reuse research software. The codebase is available at the project's Gitlab.
Interface for evaluating text reuses
Interface for evaluating text reuses
My booklet 0 0