Increasing the uptake of language technologies
Language technologies play an important role in breaking down language barriers, promoting multiculturalism and making Europe’s digital decade accessible for all. These technologies rely on large amounts of data, and with better access and usage of language resources, they can also provide multilingual solutions that will support the emerging Digital Single Market in Europe. However, language technology specialists spend around 80 % of their time cleaning, organising and collecting data sets because data is not ‘ready to use’. Extract-transform-load process, which involves linking data sets to existing designs, has the potential to reduce this effort. However, the technology remains unexploited. This is where the EU-funded Pret-a-LLOD project comes in. “We aimed to combine linked data technologies with natural language processing (NLP) techniques to increase the availability of language technologies for individuals and enterprises in Europe,” explains project coordinator John McCrae. The use of linked data technologies allows data to be more easily shared and managed on the web and thus increases the availability and accessibility of data. “In this way, the project is similar to the goals of the FAIR initiative to increase the usefulness of data,” notes McCrae.
Delivering a data value chain and key open-source components
The project developed a data value chain that covers all aspects of a data set’s life cycle and, in particular, the discovery, transformation, management (especially of licenses), linking and application in NLP workflows. They also delivered five key open-source components that support the data value chain envisioned by the project. “Firstly, the LingHub2 portal allows language resources to be discovered using linked data principles and query methods and aggregates data from a wide variety of sources. Secondly, we have developed Fintan, a novel and flexible engine for the transformation of data from a wide variety of formats into linked data,” highlights McCrae. Tools for policy-driven data management that allow the possible combination of open-source licenses to be predicted based on the Open Digital Rights Language were also developed. Furthermore, several tools for linking data sets at several levels, including lexicalisation of existing resources, conceptual level linking and lexical linking that enable data sets to be connected and integrated more easily, were also built. “We have also developed Teanga, a workflow management tool that allows different components and data sets to be used in workflows defined with technologies such as Docker and OpenAPI,” adds McCrae.
Paving the way to flexible NLP pipelines
These tools have been validated by demos with the project’s commercial partners, including a novel chatbot system developed by Derilinx, extensions to Semantic Web Company’s popular PoolParty tool for terminology management, novel methodologies for cross lingual NLP at Semalytix, and improvements to the processes used to develop the dictionaries at Oxford University Press, including the Oxford English Dictionary. “We hope that this project will ensure that more data is available, allowing NLP pipelines to be more flexible and quickly applied,” concludes McCrae. A particular goal of the project is the application of NLP techniques to minoritised languages in Europe where resources are not sufficiently available and the situation for these languages can be improved by the data management and NLP tools developed in this project.
Keywords
Pret-a-LLOD, NLP, language technologies, data value chain, linked data technologies, open-source components, natural language processing, Digital Single Market