Periodic Reporting for period 1 - Intellexus (Geology of Texts, Genealogy of Concepts, Intellectual Ecosystems: Mapping the Indic and Tibetic Buddhist Text Corpora)
Berichtszeitraum: 2024-04-01 bis 2025-09-30
Intellexus involves two main research groups. A humanistic group, which is based at the University of Hamburg (UHH), consists of philologists focusing on textual material from one (or more) of the three corpora under investigation and covering a range of genres and topics that serve as case studies for the development of the computational tools. The group comprises three sub-teams, each focusing on material from one of the three corpora under investigations. The computer-scientific group, which is based at Reichman University (RUNI), comprises two sub-teams, including a research team, focusing on language modelling and algorithm development, and a technology and development team, focusing on the empirical work, software development of the technology, and support and management of the compute environment.
Intellexus has two main parts, micro- and macro-level, each covering three years. The micro-level focuses primarily on individual texts, concepts, and elements of the intellectual ecosystems, and aims to develop monolingual computational tools and methods for Sanskrit and Tibetan. This level also involves historical-philological groundwork for texts selected as case studies, and investigations of the overarching topics “Patterns of Intertextuality” and “Intersecting Pathways of Ideas.” In the macro-level we zoom out in focus to develop methods that can be applied cross-lingually and across the corpora and the entire intellectual ecosystems. These include clustering, stratification, cross-lingual matching of texts and concepts, and visualisations. On this level, we will apply the developed tools and methods to the case studies, and investigate the overarching topic “Profiling Intellectual Ecosystems.”
The UHH and RUNI teams have collaborated on various topics. They first created a benchmark (DharmaBench) for evaluating different language models for 13 classification and detection tasks for Sanskrit and Tibetan, which resulted in an article that was accepted for publication and will be presented at the IJCNLP-AACL 2025 conference. In the following, four computational tasks were worked on by four teams: (a) Sanskrit verse detection; (b) Tibetan text matching; (c) Segmentation of Tibetan texts into allochthonous and autochthonous; (d) Metaphor detection in Sanskrit and Tibetan as pre-work for concept detection and concept matching. An article on allochthonous-autochthonous segmentation was submitted to LREC: T05 Digital Humanities, Cultural Heritage and Computational Social Science. Two more articles are currently being under preparation. In addition, team members from UHH and RUNI worked jointly on the conception and construction of the Intellexus Platform, particularly the Intellectual Ecosystem Database.
The RUNI team carried out computational work in several areas. (1) As part of the computational groundwork for the concept-matching component, a research line exploring new computational methods for detecting non-compositional expressions in running text have been initiated. To establish a solid foundation within the broader computational linguistics field, the work focuses on several general-purpose languages. This effort resulted in two accepted publications, to be presented at IJCNLP–AACL 2025 in Mumbai, India, and at EMNLP 2025 in China. (2) Research on Language Model Adaptation for Low-Resource Languages was initiated, which was then extended to Tibetan and Sanskrit. (3) Innovations in parameter-efficient fine-tuning (PEFT), aimed at addressing the scarcity of annotated data while mitigating catastrophic forgetting during model adaptation, have been developed. (4) Methods for vocabulary extension in generic LLMs to support Sanskrit and other underrepresented languages are currently being explored. (5) Additional contributions include script type detection and bidirectional conversion for Tibetan (e.g. Wylie) and Sanskrit (e.g. IAST) were developed. (6) A significant R&D investment has been made in the development of the Intellexus cloud-based platform, including high-level platform design, software architecture, DevOps and deployment, metadata database schema, and metadata migration.
During this period Intellexus organized several events, including the first Intellexus workshop; the first Intellexus hackathon; an Intellexus Panel in the XXth IABS; and two collaborative ARCAS-Intellexus workshops.
Intellexus will develop a platform that comprises texts in Sanskrit and Tibetan, metadata related to them, and sophisticated NLP tools and visualisation options, which will open new ways to understanding the development and evolution of Indic and Tibetic Buddhist intellectual cultures. The Intellexus platform will allow an identification of direct and indirect borrowing of texts and ideas on a large scale, both mono- and cross-lingually. It will also provide the means to trace the pathways along which concepts travelled and identify their intersections.
The platform will enable a large-scale mapping of the Indic and Tibetic texts, the idea contained therein, and the intellectual ecosystems behind them. At the same time, it will also facilitate individual research projects focusing on selected texts, in offering answers to complex questions concerning textual history, history of ideas, and history of textual culture.