Periodic Reporting for period 3 - POSTDATA (Poetry Standardization and Linked Open Data)
Reporting period: 2018-02-01 to 2019-07-31
The advantages of making poetry available online as machine-readable linked data are threefold: first, the academic community will have an accessible digital platform to work with poetic corpora and to contribute to its enrichment with their own texts; second, this way of encoding and standardizing poetic information will be a guarantee of preservation for poems published only in manuscripts or even transmitted orally, as texts will be digitized and stored; third: datasets and corpora will be available in open access, thus the data could be used by the community for other purposes, such as education, cultural dissemination or entertainment.
On one hand, interoperability problems between the different poetry collections will be solved by using semantic web technologies to link and publish literary datasets in a structured way in the linked data cloud. For this purpose, we are building a poetry ontology. The first step was to build a domain model for poetry. Once this was finished, we started an environmental scan, we looked for metadata schemas that are related to the heterogeneous fields of Humanities. As a result, we selected more than 50 metadata schemas considering that they could contain concepts somehow related to our model and 15 controlled vocabularies. After this process, we continued with the vocabulary alignment, which consists of retrieving all possible candidates found in the metadata schemas selected during the environmental scan. This step allows building a semantic model in the Linked Open Data (LOD) ecosystem. Due to the complexity of the poetry domain, this ontology will be developed as a network ontology. The ontology will allow the communication of existing data that couldn’t be shared before. For instance, the project will provide a large amount of information (open science) to expand the frontiers of knowledge and research.
On the other hand, automatization problems will be solved by the creation of a Poetry Lab. Thanks to this lab, researchers would be able to implement the most up-to-date language technologies and computational methods to process poetry data. Since no set of tools to address basic poetry issues existed before, the Poetry Lab will make life easier for researchers and users by democratizing technology and user experience.
Regarding the building of a poetry ontology, the project focused on finishing the domain model (DM) of the aforementioned Ontology. To that end, the final revision of the concepts of the domain model was carried out through the following tasks:
1. Analysis of the model of sixteen poetic repertoires.
2. Analysis of a survey addressed to the final users of poetic resources in order to understand the data needs of the users of poetry databases.
3. Analysis of the graphical user interface on the Web of Documents of five repertoires to retrieve the informational needs of specific poetic repertoires.
4. Analysis of six poems from different traditions to create use cases applying the data model.
5. Identification of the properties of the data model that need to be defined with a controlled vocabulary.
6. Query of multiple databases looking for LOD vocabularies that could contain vocabulary terms that could be incorporated into the ontology.
In total, twenty-three repertoires were analysed, covering sixteen different languages. http://postdata.linhd.es/partners/
In addition, the team has begun to prepare the validation process of the DM. This process consists in providing the means for an expert not familiar with POSTDATA’s DM to analyse a poetic resource in a manner compatible with our data model.
After finished this step, we started with the process of building the poetry ontology. Due to the complexity of the domain, we decided to build a network of ontologies instead with the purpose of easing the use of it.
Currently, three ontologies out of eight are delivered in the project web. http://postdata.linhd.es/prototype/.
Regarding the creation of tools for the automatic annotation of poetic features, the project carried out Natural Language Processing (NLP) research applying it to tool development and also worked on the creation of corpora. http://postdata.linhd.es/prototype/
Tools developed by the project:
- HisMeTag (Hispanic Medieval Tagger)
- ANJA (Automatic eNJambment detection)
- SKAS (Scansion in Spanish)
All these tools have been revised and updated using Spacy framework.
Semantic Modelling and Linked Open Data (LOD). The effort of gathering data with an encyclopaedic spirit was the origin of poetical repertoires. Interoperability among poetic repertoires is not simple, as there are not only technical issues involved, but also conceptual and terminological problems: each repertoire belongs to its own poetical tradition and each tradition has developed an idiosyncratic analytical terminology in a different and independent way for years. As no previous model of such a poetic conceptualization existed before, the common data model created will be one of the main and most innovative contributions of the project. Regarding the progress beyond the state of the art in this aspect, more poetic resources than the ones initially stated in the project proposal were found and analysed. This has made more difficult the modelling process, but as a result, POSTDATA´s model become more comprehensive and, especially, more rigorous.
Poetry Lab: it will include different levels of poetry scholarship, from the most formal processes to the most cognitive and subjective ones involving Artificial Intelligence techniques. In terms of Natural Language Processing, on one hand, the project worked on geolocation (with the support of the Pelagios project) implementing a tool valid for Spanish Medieval texts which overcomes the current state of the art. On the other hand, the project has made great progress in the automatic detection of enjambment, a difficult prosodic phenomenon to analyse.