Increasing the uptake of language technologies

An EU-funded project has set out to make linguistic linked open data ready to use to help ensure that no citizens are left out of the digital revolution.

Digital Economy

Society

Language technologies(opens in new window) play an important role in breaking down language barriers, promoting multiculturalism and making Europe’s digital decade accessible for all. These technologies rely on large amounts of data, and with better access and usage of language resources, they can also provide multilingual solutions that will support the emerging Digital Single Market in Europe. However, language technology specialists spend around 80 % of their time cleaning, organising and collecting data sets because data is not ‘ready to use’. Extract-transform-load process, which involves linking data sets to existing designs, has the potential to reduce this effort. However, the technology remains unexploited. This is where the EU-funded Pret-a-LLOD(opens in new window) project comes in. “We aimed to combine linked data technologies with natural language processing (NLP) techniques to increase the availability of language technologies for individuals and enterprises in Europe,” explains project coordinator John McCrae. The use of linked data technologies allows data to be more easily shared and managed on the web and thus increases the availability and accessibility of data. “In this way, the project is similar to the goals of the FAIR(opens in new window) initiative to increase the usefulness of data,” notes McCrae.

Delivering a data value chain and key open-source components

The project developed a data value chain that covers all aspects of a data set’s life cycle and, in particular, the discovery, transformation, management (especially of licenses), linking and application in NLP workflows. They also delivered five key open-source components that support the data value chain envisioned by the project. “Firstly, the LingHub2 portal allows language resources to be discovered using linked data principles and query methods and aggregates data from a wide variety of sources. Secondly, we have developed Fintan, a novel and flexible engine for the transformation of data from a wide variety of formats into linked data,” highlights McCrae. Tools for policy-driven data management that allow the possible combination of open-source licenses to be predicted based on the Open Digital Rights Language were also developed. Furthermore, several tools for linking data sets at several levels, including lexicalisation of existing resources, conceptual level linking and lexical linking that enable data sets to be connected and integrated more easily, were also built. “We have also developed Teanga, a workflow management tool that allows different components and data sets to be used in workflows defined with technologies such as Docker and OpenAPI,” adds McCrae.

Paving the way to flexible NLP pipelines

These tools have been validated by demos with the project’s commercial partners, including a novel chatbot system developed by Derilinx, extensions to Semantic Web Company’s popular PoolParty tool for terminology management, novel methodologies for cross lingual NLP at Semalytix, and improvements to the processes used to develop the dictionaries at Oxford University Press, including the Oxford English Dictionary. “We hope that this project will ensure that more data is available, allowing NLP pipelines to be more flexible and quickly applied,” concludes McCrae. A particular goal of the project is the application of NLP techniques to minoritised languages in Europe where resources are not sufficiently available and the situation for these languages can be improved by the data management and NLP tools developed in this project.

Keywords

Project Information

Pret-a-LLOD

Grant agreement ID: 825182

Project website

DOI

10.3030/825182

Project closed

EC signature date 14 November 2018

Start date 1 January 2019

End date 30 June 2022

Funded under

INDUSTRIAL LEADERSHIP - Leadership in enabling and industrial technologies - Information and Communication Technologies (ICT)

Total cost

€ 2 997 181,25

EU contribution

€ 2 997 181,25

2 997 181,25

Coordinated by

UNIVERSITY OF GALWAY
Ireland

Increasing the uptake of language technologies

Delivering a data value chain and key open-source components

Paving the way to flexible NLP pipelines

Keywords

Share this page Share this page on social networks

Download Download the content of the page