Periodic Reporting for period 1 - MIDRASH (Migrations of Textual and Scribal Traditions via Large-Scale Computational Analysis of Medieval Manuscripts in Hebrew Script)
Berichtszeitraum: 2023-10-01 bis 2025-03-31
The EPHE2 team has been working on improving and feeding the database-platform “HebrewPal: Digital Album of Hebrew Palaeography” with finegrained manual paleographical descriptors. So far, 1000 manuscript samples have been included in the database. Letters and ornaments have been manually annotated to learn pertinent features revealing time, place, or scribe. In addition to the palaeographical study, EPHE2 has worked on the development of an additional tool on HebrewPal: colour coding of clauses of the annotated legal documents for a diplomatic analysis. The writing of a collective volume of the Hebrew Palaeography Method is in progress.
The BIU team is in charge of the natural language processing (NLP) component of the project. It has worked on: (1) An automatic post-OCR correction algorithm for medieval Hebrew texts. (2) MsBERT, a new BERT model – the first of its kind – dedicated specifically to corpora of medieval Hebrew manuscript transcriptions. The aforementioned post-OCR correction algorithm is based in part on this BERT model. (3) A BERT model for Judaeo-Arabic. The current post-OCR correction algorithm works well in Hebrew but not with Judeo-Arabic. In order to address this lacuna, we are currently producing a dedicated Judeo-Arabic BERT model. We have worked so far on the curation and preprocessing of the data, and will soon set the model training. (4) Linguistic parsing infrastructure: Much of the medieval manuscripts that we are focused upon contain older layers of Hebrew text (Jewish legal texts; philosophical Hebrew treatises; hermeneutical texts; etc.). However, the basic infrastructure for automatic linguistic analysis of historical Hebrew text is sorely lacking – not only is there no syntactic parser for such texts, but there isn’t even any model to chunk such texts into manageable units. We are currently developing models to fill this lacuna, creating an automatic chunking algorithm and a syntactic parser for Historical Hebrew. Once completed, these models will be integrated into the automatic post-OCR correction algorithm, to allow more intelligent and linguistically-informed decisions on OCR correction.
The TAU team has been working on deep-learning solutions for processing images of medieval Hebrew manuscripts, including: segmenting text regions and identifying text lines of text; layout recognition; automatic prediction of paleographic features; automatic clustering and sub-clustering of geographical types of script. We have published our layout classifier, NetLay. We are currently advancing towards automatic dating of medieval Hebrew manuscripts, building the largest ever dataset of 7800 images for deep learning.
The NLI has purchased and installed the “Bet Eqed” GPU cluster, which hosts Midrash’s transcription and enrichment software and serves as the computational backbone of the project. We exported 550K catalogue records, ca. 20M images of digital manuscripts and books, and 3K transcriptions, from our collections of Hebrew manuscripts, printed Hebrew books, and Geniza fragments, a total of ca. 600TB. We are improving our catalogue to enable the future integration of enriched outputs generated by the project.
HebrewPal is used as a model for a new project developed in collaboration with MiDRASH - Palaeographical Album of Arabic Script (ArabicPal)+