This site has been archived on
The Community Research and Development Information Service - CORDIS
Information & Communication Technologies

Language Technologies


Back to overview

Please note that the project factsheets will no longer be updated.  All information relevant to the project can be found on the CORDIS factsheet .  This is updated on a regular basis with public deliverables, etc.


ACCURAT - Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation

248347- STREP


At a glance

ICT-2007.2.2 - Cognitive Systems, Interaction, Robotics

Almost a half of the citizens of the European Union do not have a good command of a language other than their native language. There is an urgent need for advanced language technologies and tools to facilitate international and multilingual interaction.


In recent decades data-driven approaches have significantly advanced the development of machine translation. However, the applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies dramatically from being quite good for language pairs with large corpora available (e.g. English and French) to being almost unusable for under-resourced languages and domains (e.g. Latvian and Croatian).


The ACCURAT project addresses the widely recognized bottleneck of insufficient parallel corpora for data-driven MT systems. The goal of the project is to research methods and techniques how to exploit comparable corpora to overcome the problem of the lack of linguistic resources for under-resourced areas of machine translation. The ultimate aim of the project is to achieve a significant increase in MT translation quality for under-resourced languages and narrow domains.

Scientific Innovation

The project will develop methods and tools to measure, find and use comparable corpora to improve the quality of statistical and rule-based MT systems for under-resourced languages and domains

The result

The ACCURAT project will provide researchers and developers with a fully functional model, methodology and tools for exploiting comparable corpora in MT:

  • criteria of corpora comparability and methods of measuring it (criteria and metrics of comparability and parallelism)
  • methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora (toolkit for multi-level alignment and information extraction from comparable corpora)
  • tools for building comparable corpora from the Web
  • collection of comparable corpora for under-resourced languages and narrow domains for the project languages
  • methodology for application of data extracted from comparable corpora in statistical and rule-based machine translation.

The ACCURAT methodology and tools will be demonstrated by application scenarios developed in the project: translation solutions for professional translators in localization services, MT for web authoring – blog writing, and MT in narrow domains.


The ACCURAT project will provide methods for automatic acquisition and annotation of language resources, removing gaps in language coverage and increasing quality of translation and providing methods for automated translation to make it more adaptive. ACCURAT will have a positive impact on further European integration of research and ICT industry from countries whic have recently acceded to the EU (Latvia, Lithuania, Estonia, Romania, Slovenia) and the candidate country Croatia.

The technological advances brought about by the ACCURAT project will advance the overall theory and practice of MT, corpus linguistics, information extraction and natural language processing on the whole.


The project will start on 1st January 2010 and run until 30th June 2012.

The consortium consists of: Tilde SIA, University of Sheffield, University of Leeds, Athena Research and Innovation Center in Information Communication & Knowledge Technologies, University of Zagreb, DFKI, Institutul de Cercetari Pentru Inteligentia Artificiala, Linguatec GmbH and Zemanta d.o.o.

Where will the project be present?

At least two international workshops will be organized as satellite events to major conferences. The objective of the workshops is to present the advances accomplished within the project and to use these to update the scientific state of the art of this domain.

ACCURAT will introduce the advances made by the project at

  • LREC2010 workshop "Methods for the automatic acquisition of Language Resources and their evaluation methods" on Sunday, 23rd May 2010
  • Two papers about the ACCURAT project have been accepted at the LREC 2010 "3rd Workshop on Building and Using Comparable Corpora". For the Workshop programme please see .

Beyond scientific circles presentations at LISA (Localization Industry Standards Association) and Localization World conferences are planned.

At least two international workshops will be organised as satellite events to major conferences. The objective of the workshops is to present the advances accomplished with the project and to use these to update the scientific state of the art of the domain.


Contact Person:

Name: Aivars Berzins

Tel: +371 67605001

Fax: +37167605750


Organisation: Tilde SIA


Back to overview


This page is maintained by: Susan Fraser (email removed)