Skip to main content

Poetry Standardization and Linked Open Data

Periodic Reporting for period 5 - POSTDATA (Poetry Standardization and Linked Open Data)

Periodo di rendicontazione: 2021-10-01 al 2022-04-30

POSTDATA (Poetry Standardization and Linked Open Data) aims at bridging the digital gap among traditional cultural assets and the growing world of data. It is focused on poetry analysis, classification and publication, applying Digital Humanities methods. The goal is to look for standardization, as well as innovation by using semantic web technologies to link and publish literary datasets in a structured way in the linked data cloud.

The advantages of making poetry available online as machine-readable linked data are threefold: first, the academic community will have an accessible digital platform to work with poetic corpora and to contribute to its enrichment with their own texts; second, this way of encoding and standardizing poetic information will be a guarantee of preservation for poems published only in manuscripts or even transmitted orally, as texts will be digitized and stored; third: datasets and corpora will be available in open access, thus the data could be used by the community for other purposes, such as education, cultural dissemination or entertainment.

During the project, we have accomplished all the goals that were set up at the beginning of the project:
1. We have finished the PoetryLab tools, a set of tools for analyzing the Spanish poetry and for the creation of corpus in poetry (i.e. Rantamplan, Hismetag and Averell, detailed below)
2. We finished the description of the poetry ontology, Ontopoetry. We also described some examples for a better understanding of its expressivity capabilities.
3. We produced Postdata linked data knowledge graph based on the corpora we had in the Averell tool. We enriched it with more information about authors using the linked data paradigm.
4. We developed the user interface for making accessible the Postdata knowledge graph based on Ontopoetry.
5. We defined the final architecture of Postdata based on dockers. So, it is possible to deploy it on any server
Poetrylab tools and Poetry Ontology are the most remarkable achievements of Postdata project. Multiple activities have been carried out to achieve projct outcomes:
1. Analysis of the model of poetic repertories.
2. Analysis of a survey addressed to the final users of poetic resources in order to understand the data needs of the users of poetry databases.
3. Analysis of the graphical user interface on the Web of Documents of repertoires to retrieve the informational needs of specific poetic repertoires.
4. Analysis of poems from different traditions to create use cases applying the data model.
5. Identification of the properties of the data model that need to be defined with a controlled vocabulary.
6. Query of multiple databases looking for LOD vocabularies that could contain vocabulary terms that could be incorporated into the ontology.
7. Build of new approached to automated scansion, stanza detection and enjambment detection.
8. Train the model with a new developed corpus (PULPO). The resulting model, Alberti, was evaluated on the MLM’ metric aforementioned for English and Spanish.

As a result, we have developed three PoetryLab TOOLS:
- Rantanplan, on top of the industrial-strength NLP framework spaCy for speed.
- Stanza detection, a classical solution based on extracting the information needed and then composing a knowledge base curated by experts with the proper rules that identify the different stanza types.
- Postdata Jollyjumper, new tool that replaced ANJA, annotates enjambment and its type based on previous typologies. There are three broad categories (below) with some subcategories each:
1. Lexical enjambment.
2. Phrase-bounded enjambment.
3. Cross-clause enjambment.
We have obtained three intellectual properties based on them: Averell, Rantamplam and Poetry Lab API.

Regarding Postdata Poetry Ontology, it facilitates a set of concepts for describing poetic works (poems, poetic drama or plays written in verse and songs). It is the product of a homogenization effort that considers different literary traditions, periods, poetic genres, and authorship. Additionally, this will enable the comparison of the characteristics and data in this poetry and thus carry out invaluable research in Comparative Literature and Comparative Metrical Studies quantitatively. Two potential cases of use used to define:
-Bibliographic information and sources search and indexing:
- OntoPoetry Core module represents the abstract or conceptual side of the bibliographic information.
- OntoPoetry Transmission module represents the more tangible side of bibliographic information related to poetic works.
-Poetic information annotation and searching:
- OntoPoetry Poetic Analysis module, which represents different phenomena associated with metrics and prosody, including the textual elements or parts of a poem and the different metrical patterns that analyze those elements.

Finally, Postdata project has been actively working on different communication and dissemination activities during the project. Project members have been involved in the dissemination of project results though different activities as: >15 publications in scientific journals, 2 chapters in books, > 60 contributions in conference proceedings, organization of 14 workshops, participation in more than 15 workshops, etc. Furthermore, the Postdatda community has been very involved in different communication activities aimed at the general public, such as, radio interviews, publications at project website, publication of informative articles at general public magazines, etc.
As it was mentioned, POSTDATA will be materialized in the creation of a digital semantic web-based platform for poetry analysis and edition, to study, publish and share digital collections in a virtual research environment using digital humanities open standards combined with traditional philological academic analysis. The environment will be open to any language and type of poetry and accessible for multiple users with different profiles, and it will provide access to digital resources on poetry linked together through data repositories. POSTDATA is based on three pillars.

Semantic Modelling and Linked Open Data (LOD). The effort of gathering data with an encyclopaedic spirit was the origin of poetical repertoires. Interoperability among poetic repertoires is not simple, as there are not only technical issues involved, but also conceptual and terminological problems: each repertoire belongs to its own poetical tradition and each tradition has developed an idiosyncratic analytical terminology in a different and independent way for years. As no previous model of such a poetic conceptualization existed before, the common data model created will be one of the main and most innovative contributions of the project. Regarding the progress beyond the state of the art in this aspect, more poetic resources than the ones initially stated in the project proposal were found and analysed. This has made more difficult the modelling process, but as a result, POSTDATA´s model become more comprehensive and, especially, more rigorous.

Overall, all the achievements obatined around PoetryLab tools are completely new in Spanish and over the state of the art, especially those outcomes related to to using deep learning in the poetry field. Furthermore, Postdata Poetry Ontology is the ontology of its kind in the world.
Figure 0: POSTDATA Logo
Figure 2: Excerpt of the Domain Model for European Poetry
Figure 1: The process of development of the Domain Model