Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script

Projektinformationen

MIDRASH

ID Finanzhilfevereinbarung: 101071829

DOI

10.3030/101071829

EK-Unterschriftsdatum 18 April 2023

Startdatum 1 Oktober 2023

Enddatum 30 September 2029

Finanziert unter

European Research Council (ERC)

Gesamtkosten

€ 10 296 259,00

EU-Beitrag

€ 10 296 259,00

10 296 259,00

Koordiniert durch

ECOLE PRATIQUE DES HAUTES ETUDES
France

Periodic Reporting for period 1 - MIDRASH (Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script)

Berichtszeitraum: 2023-10-01 bis 2025-03-31

Most of what we know of the human past has been learned from ancient handwritten sources, most of which remain untranscribed let alone studied, but recent AI developments make an analysis feasible. Medieval Jews from China to England produced a multitude of Hebrew characters manuscripts in Hebrew, Aramaic, Judeo-Arabic and other vernacular languages. The surviving ca. 100.000 manuscripts and 350.000 fragments constitute the unique remains of medieval Jewish literary culture, witnessing to centuries of learning, but also migration, persecution and censorship. Our project is set to enable freely available full-text search of the entire corpus of medieval Hebrew script manuscripts, including many previously unexamined sources using the National Library of Israel’s KTIV project that has assembled digital images of almost all Hebrew script manuscripts from public and private libraries from all over the world as a unique starting point. It has the potential to revolutionize Jewish Studies and the research on manuscript cultures in general. The manual and computational analysis of this enormous corpus of millions of pages will reveal and map previously unsuspected networks of transmission, and migration of textual and scribal features. For the very first time, scholars and laypersons will be able to conduct comprehensive queries across the entire corpus, combining intelligent full-text search with rich metadata filters. Known and unknown works will be identified and reconstructed and multilingual intertextualities discerned. The best fine-grained manual paleographical techniques and AI will be brought to bear on questions of provenance and transmission. Philological methods and computational linguistics will be applied to questions of textual fluidity and evolution to further our knowledge of the production and transmission of manuscripts and texts, their authors, scribes and readers, and enhance their role as the pivotal aspect of European and Mediterranean intellectual history.

For the transcription, EPHE1 has defined conventions for encoding layout segmentation, reading order and text recognition for Genizah fragments, manuscripts and books. We have gathered existing and created further training ground truth and worked on homogenizing it to these conventions. We have homogenized catalogue data and among others created a list of publishers. We have trained preliminary segmentation and recognition models for print and manuscripts. We are working on improving our current architectures with transformers fusing visual, linguistic and metadata dimensions. We have created a pipeline that streamlines the best choice for layout segmentation and text recognition models adapted to the object with a layout classifier by the TAU team and applied it to the most complex data, the Genizah fragments. The publication of the first complete automatic transcription of 1 million images is in preparation. We have also created TABA, a pipeline for self-supervised ground truth creation. Transcriptions of books and manuscripts are underway.
The EPHE2 team has been working on improving and feeding the database-platform “HebrewPal: Digital Album of Hebrew Palaeography” with finegrained manual paleographical descriptors. So far, 1000 manuscript samples have been included in the database. Letters and ornaments have been manually annotated to learn pertinent features revealing time, place, or scribe. In addition to the palaeographical study, EPHE2 has worked on the development of an additional tool on HebrewPal: colour coding of clauses of the annotated legal documents for a diplomatic analysis. The writing of a collective volume of the Hebrew Palaeography Method is in progress.
The BIU team is in charge of the natural language processing (NLP) component of the project. It has worked on: (1) An automatic post-OCR correction algorithm for medieval Hebrew texts. (2) MsBERT, a new BERT model – the first of its kind – dedicated specifically to corpora of medieval Hebrew manuscript transcriptions. The aforementioned post-OCR correction algorithm is based in part on this BERT model. (3) A BERT model for Judaeo-Arabic. The current post-OCR correction algorithm works well in Hebrew but not with Judeo-Arabic. In order to address this lacuna, we are currently producing a dedicated Judeo-Arabic BERT model. We have worked so far on the curation and preprocessing of the data, and will soon set the model training. (4) Linguistic parsing infrastructure: Much of the medieval manuscripts that we are focused upon contain older layers of Hebrew text (Jewish legal texts; philosophical Hebrew treatises; hermeneutical texts; etc.). However, the basic infrastructure for automatic linguistic analysis of historical Hebrew text is sorely lacking – not only is there no syntactic parser for such texts, but there isn’t even any model to chunk such texts into manageable units. We are currently developing models to fill this lacuna, creating an automatic chunking algorithm and a syntactic parser for Historical Hebrew. Once completed, these models will be integrated into the automatic post-OCR correction algorithm, to allow more intelligent and linguistically-informed decisions on OCR correction.
The TAU team has been working on deep-learning solutions for processing images of medieval Hebrew manuscripts, including: segmenting text regions and identifying text lines of text; layout recognition; automatic prediction of paleographic features; automatic clustering and sub-clustering of geographical types of script. We have published our layout classifier, NetLay. We are currently advancing towards automatic dating of medieval Hebrew manuscripts, building the largest ever dataset of 7800 images for deep learning.
The NLI has purchased and installed the “Bet Eqed” GPU cluster, which hosts Midrash’s transcription and enrichment software and serves as the computational backbone of the project. We exported 550K catalogue records, ca. 20M images of digital manuscripts and books, and 3K transcriptions, from our collections of Hebrew manuscripts, printed Hebrew books, and Geniza fragments, a total of ca. 600TB. We are improving our catalogue to enable the future integration of enriched outputs generated by the project.

Our methodologies and expertise has attracted considerable attention and created further collaborations with other teams: The ERC DeLiCaTe project uses eScriptorium for Medieval Georgian and Armenian. In a collaboration with Beth Mardutho, the Institute of Advanced Studies, Princeton, and Princeton University, as well as the Bibliothèque Nationale de France and the biblioteca apostolica vaticana, our Paris team works on a complete automatic transcription of all their digitized Syriac manuscripts. With the ERC Kitab and the openITI teams, we work on improving automatic transcriptions of Arabic manuscripts. With the ALMAnaCH team of the INRIA, we work on mass application of automatic transcription to Latin and French manuscripts and books.
HebrewPal is used as a model for a new project developed in collaboration with MiDRASH - Palaeographical Album of Arabic Script (ArabicPal)+

Periodic Reporting for period 1 - MIDRASH (Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script)

Herunterladen Den Inhalt der Seite herunterladen