• European Commission
  • CORDIS
  • FP7
  • ICT
  • ...
  • Language Technologies

Language Technologies


Back to overview

ACCURAT - Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation

248347- STREP

accurat-logo.jpg

At a glance

ICT-2007.2.2 - Cognitive Systems, Interaction, Robotics

Almost a half of the citizens of the European Union do not have a good command of a language other than their native language. There is an urgent need for advanced language technologies and tools to facilitate international and multilingual interaction.

Challenge

In recent decades data-driven approaches have significantly advanced the development of machine translation. However, the applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies dramatically from being quite good for language pairs with large corpora available (e.g. English and French) to being almost unusable for under-resourced languages and domains (e.g. Latvian and Croatian).

Goal

The ACCURAT project addresses the widely recognized bottleneck of insufficient parallel corpora for data-driven MT systems. The goal of the project is to research methods and techniques how to exploit comparable corpora to overcome the problem of the lack of linguistic resources for under-resourced areas of machine translation. The ultimate aim of the project is to achieve a significant increase in MT translation quality for under-resourced languages and narrow domains.

Scientific Innovation

The project will develop methods and tools to measure, find and use comparable corpora to improve the quality of statistical and rule-based MT systems for under-resourced languages and domains

The result

The ACCURAT project will provide researchers and developers with a fully functional model, methodology and tools for exploiting comparable corpora in MT:

  • criteria of corpora comparability and methods of measuring it (criteria and metrics of comparability and parallelism)
  • methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora (toolkit for multi-level alignment and information extraction from comparable corpora)
  • tools for building comparable corpora from the Web
  • collection of comparable corpora for under-resourced languages and narrow domains for the project languages
  • methodology for application of data extracted from comparable corpora in statistical and rule-based machine translation.

The ACCURAT methodology and tools will be demonstrated by application scenarios developed in the project: translation solutions for professional translators in localization services, MT for web authoring – blog writing, and MT in narrow domains.

Impact

The ACCURAT project will provide methods for automatic acquisition and annotation of language resources, removing gaps in language coverage and increasing quality of translation and providing methods for automated translation to make it more adaptive. ACCURAT will have a positive impact on further European integration of research and ICT industry from countries whic have recently acceded to the EU (Latvia, Lithuania, Estonia, Romania, Slovenia) and the candidate country Croatia.

The technological advances brought about by the ACCURAT project will advance the overall theory and practice of MT, corpus linguistics, information extraction and natural language processing on the whole.

Timeframe

The project will start on 1st January 2010 and run until 30th June 2012.

The consortium consists of: Tilde SIA, University of Sheffield, University of Leeds, Athena Research and Innovation Center in Information Communication & Knowledge Technologies, University of Zagreb, DFKI, Institutul de Cercetari Pentru Inteligentia Artificiala, Linguatec GmbH and Zemanta d.o.o.

Where will the project be present?

ACCURAT is co-organising the 5th BUCC workshop at LREC2012

The proposal for a joint proposal for 5th BUCC workshop, that will be held as a LREC2012 workshop in Istanbul on 2012-05-26, has been accepted. ACCURAT is selected to be an organiser together with a number of important EU-funded projects dealing with comparable and parallel corpora. The special topic of the workshop is . More information including Call for Papers could be seen at the workshop web site.

 

Co-ordinator

Contact Person:

Name: Aivars Berzins
Tel: +371 67605001
Fax: +37167605750
E-mail
Organisation: Tilde SIA

 

























Back to overview



This page is maintained by: Susan Fraser