Back to overview
ACCURAT - Analysis and Evaluation of Comparable Corpora for Under-Resourced Areas of Machine Translation
248347- STREP

At a glance
ICT-2007.2.2 - Cognitive Systems, Interaction, Robotics
- Duration: 30 months
- Start date: 1 January 2010
- End date: 30 June 2012
-
Project officer: Aleksandra Wesolowska
- Website
- Annual Report 2011
- Annual Report 2010
- Presentation
- Flyer
At a glance
ICT-2007.2.2 - Cognitive Systems, Interaction, Robotics
- Duration: 30 months
- Start date: 1 January 2010
- End date: 30 June 2012
- Project officer: Aleksandra Wesolowska
- Website
- Annual Report 2011
- Annual Report 2010
- Presentation
- Flyer
Almost a half of the citizens of the European Union do not have a good command of a language other than their native language. There is an urgent need for advanced language technologies and tools to facilitate international and multilingual interaction.
Challenge
In recent decades data-driven approaches have significantly advanced the development of machine translation. However, the applicability of current data-driven methods directly depends on the availability of very large quantities of parallel corpus data. For this reason the translation quality of current data-driven MT systems varies dramatically from being quite good for language pairs with large corpora available (e.g. English and French) to being almost unusable for under-resourced languages and domains (e.g. Latvian and Croatian).
Goal
The ACCURAT project addresses the widely recognized bottleneck of insufficient parallel corpora for data-driven MT systems. The goal of the project is to research methods and techniques how to exploit comparable corpora to overcome the problem of the lack of linguistic resources for under-resourced areas of machine translation. The ultimate aim of the project is to achieve a significant increase in MT translation quality for under-resourced languages and narrow domains.
Scientific Innovation
The project will develop methods and tools to measure, find and use comparable corpora to improve the quality of statistical and rule-based MT systems for under-resourced languages and domains
The result
The ACCURAT project will provide researchers and developers with a fully functional model, methodology and tools for exploiting comparable corpora in MT:
- criteria of corpora comparability and methods of measuring it (criteria and metrics of comparability and parallelism)
- methods for alignment and extraction of lexical, terminological and other linguistic data from comparable corpora (toolkit for multi-level alignment and information extraction from comparable corpora)
- tools for building comparable corpora from the Web
- collection of comparable corpora for under-resourced languages and narrow domains for the project languages
- methodology for application of data extracted from comparable corpora in statistical and rule-based machine translation.
The ACCURAT methodology and tools will be demonstrated by application scenarios developed in the project: translation solutions for professional translators in localization services, MT for web authoring – blog writing, and MT in narrow domains.
Impact
The ACCURAT project will provide methods for automatic acquisition and annotation of language resources, removing gaps in language coverage and increasing quality of translation and providing methods for automated translation to make it more adaptive. ACCURAT will have a positive impact on further European integration of research and ICT industry from countries whic have recently acceded to the EU (Latvia, Lithuania, Estonia, Romania, Slovenia) and the candidate country Croatia.
The technological advances brought about by the ACCURAT project will advance the overall theory and practice of MT, corpus linguistics, information extraction and natural language processing on the whole.
Timeframe
The project will start on 1st January 2010 and run until 30th June 2012.
The consortium consists of: Tilde SIA, University of Sheffield, University of Leeds, Athena Research and Innovation Center in Information Communication & Knowledge Technologies, University of Zagreb, DFKI, Institutul de Cercetari Pentru Inteligentia Artificiala, Linguatec GmbH and Zemanta d.o.o.
Where will the project be present?
ACCURAT is co-organising the 5th BUCC workshop at LREC2012
The proposal for a joint proposal for 5th BUCC workshop, that will be held as a LREC2012 workshop in Istanbul on 2012-05-26, has been accepted. ACCURAT is selected to be an organiser together with a number of important EU-funded projects dealing with comparable and parallel corpora. The special topic of the workshop is . More information including Call for Papers could be seen at the workshop web site.
| Co-ordinator |
Contact Person: Name: Aivars Berzins |
| Participants |
|
This page is maintained by: Susan Fraser
