CORDIS - Forschungsergebnisse der EU
CORDIS

Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin

Periodic Reporting for period 4 - LiLa (Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin)

Berichtszeitraum: 2022-12-01 bis 2023-05-31

The wide diffusion of information technology has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. Although nowadays a large variety of digital linguistic resources (like textual corpora, lexica and dictionaries) is available for several languages, in many cases these still cannot interoperate. This is a limit, because linguistic resources become even more useful when linked to one another, which makes it possible to exploit the contribution each of them gives to linguistic analysis (at the lexical, morphological, syntactic, semantic, pragmatic level, etc.). The challenge we must meet today is that of interlinking and making the fullest use of the tremendous wealth of linguistic data accumulated over more than half a century of computational linguistics and empirical study of language.
The objective of the LiLa project was to interlink and, ultimately, make the wealth of linguistic resources for Latin assembled so far interact, thus allowing users to access and exploit them to the fullest.
To address this challenge, LiLa incorporated linguistic resources for Latin into the Linked Data framework, thus facilitating their publication, interlinking and interaction. To this end, the project built a knowledge base of interoperable resources for Latin by using the Linked Data principles to combine data from disparate sources.
The availability of interconnected data sets for Latin made possible by the LiLa knowledge base allows scholars to make sense of the wealth of resources built so far and to understand how useful these can be in their daily work. Thanks to the sucessful outcome of the project, the LiLa knowledge base is today the main venue for the publication of linguistic resources for Latin and its architecture is a reference model for similar initiatives on other (ancient and modern) languages.
The architecture of the LiLa knowledge base is highly lexically-based, grounding on a simple, but effective assumption that strikes a good balance between feasibility and granularity: textual resources are made of (occurrences of) words, lexical resources describe properties of words, and Natural Language Processing (NLP) tools process words (see Image 1). In particular, the level of lemmas ("canonical forms") is considered the ideal interface between the lexical resources, annotated corpora and NLP tools that lemmatize their input text. For this reason, the project identified and created a large list of Latin lemmas (called "Lemma Bank") as the core of LiLa (see: https://lila-erc.eu/lodview/data/id/lemma/LemmaBank). Interoperability can be achieved by linking all entries in lexical resources and corpus tokens, which refer to the same lemma.
The Lemma Bank is a reference collection of citation forms for Latin lexemes on the web, each assigned a unique and persistent identifier. The more than 200K ontolex Lexical Forms that make the Lemma Bank cover the lexicon of Classical and Medieval Latin and a large Onomasticon (source data are taken from the LEMLAT morphological analyzer). The Lemma Bank is is continuously updated and extended to address the vocabulary of the specific resources linked to LiLa.
By using the Lemma Bank as the connecting element among resources, the LiLa project interlinked several lexical and textual resources, representing different kinds of (meta)linguistic information, by using various ontologies and models for their publication as Linked Data. The full list of resources for Latin interlinked through LiLa is available at https://lila-erc.eu/data-page/.
The project built a number of online services to query and populate the LiLa knowledge base:
(1) TextLinker, a tool that automatically tokenizes, lemmatizes, PoS-tags and links to LiLa the tokens of an input raw text in Latin (https://lila-erc.eu/LiLaTextLinker/);
(2) LISP, a graphical platform to run queries on the resources interlinked in the Knowledge Base (http://lila-erc.eu:8080/lila-lisp/);
(3) an interface for querying the Lemma Bank (https://lila-erc.eu/query/);
(4) a SPARQL access point with a number of ready-made queries on the resources made interoperable through LiLa (https://lila-erc.eu/sparql/).
Before LiLa, no such a fine-grained interlinking among linguistic resources had been ever performed in the Linguistic Linked Open Data context (and beyond this). LiLa interlinks textual resources at the token level, and lexical resources at the entry level. This makes it possible to run federated queries on the resources made interoperable by their linking to the LiLa knowledge base. Methodologically, not only this eases the daily life of scholars, putting them in the condition of using at the same time information provided by different (but not anymore scattered) resources, but most importantly it supports the scholarly interpretation of data with such an organization of them never before available, making the overall process replicable.
Moreover, the LiLa knowledge base is a dynamic and open-ended venue where new resources can be interlinked. In such respect, focussing on Latin is a winning choice, as for centuries Latin has been the lingua franca of the European area and thus it is provided with several bilingual resources, like dictionaries and translations, which can now be made interoperable through LiLa. Methodologically, LiLa changed the way linguistic resources for Latin (and beyond) can be published online: they are not anymore separate silos, rather they interact thanks to the architecture of the knowledge base, by "speaking the same language", i.e. by using vocabularies of knowledge representation widely shared in the Linked Data world.
In the context of the Linguistic Linked Open Data (LLOD) community, LiLa was a successful use-case, where new ontologies were built and available ontologies were evaluated by their application to real data. Methodologically, in this respect LiLa was an innovative proof of concept, where vocabularies developed by the LOD community working on linguistic (meta)data were empirically tested (and thus extended and refined) and a large set of resources was finally interlinked. Moreover, the results of LiLa had impact on the overall world of LLOD, showing that a fine-grained level of interoperability between resources is possible through a language-independent and very simple architecture.
The fundamental architecture of LiLa