Skip to main content

Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin

Periodic Reporting for period 2 - LiLa (Linking Latin. Building a Knowledge Base of Linguistic Resources for Latin)

Reporting period: 2019-12-01 to 2021-05-31

The wide diffusion of information technology has led to a substantial growth in quantity, diversity and complexity of web-accessible linguistic data. Although nowadays a large variety of digital linguistic resources (like textual corpora, lexica and dictionaries) is available for several languages, in many cases these still cannot interoperate. This is a limit, because linguistic resources become even more useful when linked to one another, which makes it possible to exploit the contribution each of them gives to linguistic analysis (at the lexical, morphological, syntactic, semantic, pragmatic level, etc.). The challenge we must meet today is that of interlinking and making the fullest use of the tremendous wealth of linguistic data accumulated over more than half a century of computational linguistics and empirical study of language.
The objective of the LiLa project is to interlink and, ultimately, make the wealth of linguistic resources and NLP tools for Latin assembled so far interact, in order to bridge the gap between raw language data, NLP and knowledge descriptions, thus allowing users to access and exploit resources and tools available today to the fullest.
To address this challenge, LiLa intends to incorporate linguistic resources for Latin into the Linked Data framework, thus facilitating their publication, interlinking and interaction. To this end, the project builds a knowledge base for Latin by using the Linked Data principles to combine data from disparate linguistic resources, provide NLP web-services and ultimately add Latin to the multilingual Linguistic Linked Open Data cloud (https://linguistic-lod.org).
The results of the project promise to heavily impact both the world of Linguistic Linked Data and that of the Humanities. In particular:
- the LiLa knowledge base for Latin will represent a reference model for similar initiatives on other ancient languages and will become the main venue for the publication of new linguistic resources and, more generally, digital objects pertaining to Latin cultural heritage;
- the availability of interconnected data sets for Latin will allow Classicists to make sense of the wealth of resources and tools built thus far and to understand how useful these can be in their daily work. This will contribute to shaping new generations of Classicists, who will be able, for instance, to use both a print scholarly edition and a Linked Data knowledge base for Latin, thus making the current distinction between traditional and Digital Humanities meaningless.
The first half of the project (30 months) was devoted to the following two main tasks.

(1) Selecting, assessing and improving linguistic resources for Latin
The project selected more than 20 linguistic resources (corpora, lexica, dictionaries) eligible to be interlinked in LiLa.
In selecting and collecting the available resources for Latin, the LiLa team became aware of the need to assess and improve two that are considered to be essential for exploiting the textual/lexical data to the fullest, that is, the Latin WordNet and the valency lexicon Vallex. The Latin WordNet was evaluated by correcting the automatic assignment of synsets to lexical items: a set of 1,000 manually checked lexical entries was produced. As for the Vallex lexicon, evaluated lexical entries from the Latin WordNet were assigned one (or more) valency frames for each of their synsets.
Given the central role played by lemmas in the LiLa knowledge base (see next point), the project devoted considerable time and effort to developing a wide and carefully curated list of Latin lemmas ('lemmaBank'), to be used as the connecting elements between resources in the knowledge base. The list includes more than 130,000 lemmas extracted from reference dictionaries, covering Classical, Late and Medieval Latin. Each lemma is assigned a number of attributes, such as its Part of Speech, gender and inflectional category. The list also includes spelling variants.

(2) Designing a reference model for Latin linguistic resources
The project designed the fundamental reference model of the LiLa knowledge base, which specifies metadata, terminology, annotation models and relations. The reference model of LiLa is a highly lexically-based architecture, grounding on a simple, but effective assumption that strikes a good balance between feasibility and granularity: textual resources are made of (occurrences of) words, lexical resources describe properties of words, and NLP tools process words (see Image 1). In particular, the level of lemma ("canonical forms") is considered the ideal interface between the lexical resources, annotated corpora and NLP tools that lemmatize their input text. For this reason, the project identified a large list of Latin lemmas as the core of LiLa. Interoperability can be achieved by linking all entries in lexical resources and corpus tokens, which refer to the same lemma.
A user-friendly query interface for the LiLa lemmaBank is accessible at https://lila-erc.eu/query/. Here, the relations between the contents of the LiLa collection of Latin canonical forms can be viewed graphically in a network-based fashion. For instance, Image 2 shows a subset of the relations holding between the canonical forms of the lexical items connected to the morphological Base 118 (which brings together members of the lexical family of the verb "miror" "to admire"). The hasBase relation ties the nodes of the canonical forms to the Base node. Furthermore, words formed with one or more affixes are connected to their affixes via the hasPrefix/hasSuffix relation, as is the case of the noun "admirabilitas/ammirabilitas" "admirability", which is connected to both the "-bil" and "-tas/-tat" suffixes (via hasSuffix) and the prefix "ad-" (via hasPrefix).
The main achievement of the first half of the LiLa project is the creation of the large collection of Latin canonical forms ('lemmaBank') serving as the backbone of the LiLa knowledge base.
Built upon state of the art models, formats and technologies widely used in the Linguistic Linked Open Data world, the LiLa lemmaBank is now available to interlink distributed linguistic resources for Latin and make them interact.
Once the very core of the Lila knowledge base has been built, the second half of the project will turn to the following main objectives.

(1) To interlink new resources and to enlarge the collection of canonical forms
Connecting new resources to LiLa will help enlarge the lemma bank with all new lemmas found therein.
Furthermore, interlinking new resources via LiLa will expose the project to new kinds of (meta)data, thus possibly requiring the development of new models of (meta)data representation following the Linked Data principles.

(2) To build the linking workflow
The project will build a workflow to support data providers with the automatic connection of their resources to the LiLa knowledge base. The workflow will include the following services:
- automatic PoS tagging and lemmatization of unlemmatized source textual data;
- automatic linking of the lemmas of the new resource with the LiLa collection of canonical forms of Latin.

(3) To build the query interface of the LiLa knowledge base
In order to support user-friendly querying of the interlinked resources of the LiLa knowledge base, the project will build an interface capable of translating graphically-based queries into SPARQL queries.
A network-based representation of the LiLa collection of lemmas
The fundamental architecture of LiLa