Skip to main content

Common Language Resources and their Applications

Final Report Summary - CLARA (Common Language Resources and their Applications)

The scientific objectives of the CLARA project have been twofold: (1) to develop the next generation of data-intensive language models and applications by integrating approaches across language and country boundaries; (2) to contribute to the establishment of a pan-European infrastructure for language resources.

In this emerging new research context, a shortage of fully qualified candidates with knowledge of the state of the art had been shown. There had also been also a shortage of funded PhD level positions at which researchers could gain research experience with state-of-the-art methods. Access to advanced courses and supervision was not evenly distributed among institutions and countries. There were insufficient incentives for mobility and cooperation across national borders in relevant researcher training.

Therefore, training and career development of CLARA researchers has aimed to produce the next generation of linguistic scholars and engineers having suitable interdisciplinary competencies. CLARA has contributed 20 newly trained researchers to the approx. 1 million new researchers which are needed in Europe. These were trained locally on the job, through secondments, and in a joint training programme, aided by three Visiting Researchers.

The project has achieved its main goals. The following main scientific achievements have been reached.

In the subproject "Designing and Testing Common Infrastructures" at the Max Planck Institute for Psycholinguistics, the Language Archive and its software components have been extended. Machine learning systems have been developed to speed up annotation of multimedia while search in video dialogs has been improved. Knowledge discovery in linguistic recordings has taken a step forward and a media query language has been developed.

In the subproject "Lexical Semantic Modeling", the quality and efficient construction of lexical and conceptual resources have been the topic of work. At the University of Bergen, a German/Spanish corpus of technical texts has been aligned and is explored to improve statistical machine translation. At the University of Copenhagen, corpora were annotated with respect to regular polysemy in English, Spanish and Danish. At the University Pompeu Fabra, a novel method for the classification of semantic relations was developed.

The subproject "Next Generation Domain Modeling" has worked on the construction, harmonization and management of multilingual terminological resources. At the Norwegian School of Economics, a specialized corpus of English and Spanish FTA texts was aligned and studied with respect to specialized collocations in legal and economics domains. Tilde has worked on automated methods for context extraction and enrichment of a multilingual terminology database with knowledge-rich contexts.

In the subproject "Multimedia and Multimodal Communication Modeling", the Max Planck Institute and the University of Copenhagen have cooperated on innovative methods and emerging standards for audio and video annotation. Audio and video processing algorithms for automated annotation of linguistic recordings and their interfaces have been developed. The semantic relations between speech and iconic hand gestures in narrative and conversational multimodal corpora were studied and formalized.

In the subproject "Applications", the Latvian language technology company Tilde and the Charles University in Prague have cooperated on methods that facilitate the development of missing translation tools and relevant resources necessary for under-resourced languages including Indonesian and Tamil, two languages for which new dependency treebanks were developed. A second application area, Computer Assisted Language Learning, has been addressed at the University of Tübingen. This research has resulted in the creation of a comprehensive corpus of graded texts for training readability models and a rich set of linguistic features for building a robust readability model.

"Parsing Technologies and Grammar Models" has been a subproject aimed at novel approaches to parsing ranging from shallow to deep parsing models. The University of Helsinki, where he has developed new methods for efficiently implementing constraint grammar rules with finite state technologies, learning phonological replacement rules from parallel corpora, and converting context-free grammars and probabilistic context-free grammars into parallel FSMs for parsing. The research group at the University of Helsinki has released the open source toolset HFST. They have also developed methods for hyperminimization of morphological lexicons, providing a way to minimize lexicons and grammars beyond their minimal FSM size. The Greenlandic lexicon can be hyper-minimized with a size reduction of approximately 90%. The University of Bergen has implemented a morphological component and deep LFG grammar for Wolof. The Charles University in Prague has studied parsing technologies and has shown that noun phrase bracketing is helpful for machine translation. The University of Tübingen has focused on creating a natural language search facility for querying large metadata repositories and built a question answering system that is able to convert a natural language question to its Linked Data representation. They have also developed computational models of Chinese word formation and algorithmic approaches to Chinese word structure annotation. Word-based models for Chinese word segmentation have been generalized to a phrase-based model.

CLARA has had an interdisciplinary approach. With its basis in language sources, the project has been primarily situated in linguistics and the humanities, which are increasingly going digital. However, the project has also incorporated methods from information science, statistics, computer science, cognitive science, image analysis, machine learning and artificial intelligence, to name just a few. Thus, the next generation of language researchers have been trained in a new combination of training components which most universities and research institutions cannot offer by themselves.

The project has implemented ten planned international thematic training events, open also to external researchers, in its Joint Training Programme, and has contributed to an additional event (CHAT workshop). Training has also included secondments to participating institutions in other countries, and has led to joint degrees.

CLARA has societal relevance by promoting communication using natural language in the digital age. Its results will allow multilingual technologies to be put into practice in the form of new ICT solutions and services that bring people together and play an enabling role in public, personal and business communication.

The project website with project results and contact details is http://clara.b.uib.no