European Commission logo
español español
CORDIS - Resultados de investigaciones de la UE
CORDIS

Automatic building of Machine Translation

Final Report Summary - ABU-MATRAN (Automatic building of Machine Translation)

Abu-MaTran is a four-year project (2013–2016 that seeks to enhance industry-academia cooperation as a key aspect to tackle one of Europe’s biggest challenges: multilingualism. We aim to increase the hitherto low industrial adoption of machine translation (MT) by identifying crucial cutting-edge research techniques (automatic acquisition of corpora and linguistic resources, pivot techniques, linguistically augmented statistical translation and diagnostic evaluation), preparing them to be suitable for commercial exploitation and finally transferring this knowledge to industry. On the other hand, we transfer back to academia the know-how of industry regarding management, processes, etc. to make research products more robust. The project exploits the open-source business model, all the resources produced will be released as free/open-source software, resulting in effective knowledge transfer beyond the consortium.

We work on a case study of strategic interest for Europe; we provide MT for the language of a new member state (Croatian) and then extend to related languages of candidate member states.


The project has had 4 milestones.
- The first (July 2013) consisted of an on-line MT system for English–Croatian based on publicly available resources that was released on July 1st 2013, to mark Croatia's accession to the EU.
- The second (December 2014) consisted of an improved version of the on-line English–Croatian MT system released in Milestone 1. While the system for Milestone 1 was generic (targeted at general domain), for Milestone 2 we improved upon this generic system but we have also released an additional MT system for a specific domain, targeted to tourism (given the importance of this sector in Croatia's economy). This system was built using tourism parallel data acquired with the web crawlers developed within the project and outperforms other on-line MT systems for this language pair and domain.
- The third (extension of MT to other South Slavic languages), built upon our previous work on Croatian and was met with the release of MT systems for a set of related languages: Serbian, Bosnian and Slovenian.
- Finally, the fourth (final MT systems) was met with the release of (i) a substantially improved MT system for English–Croatian that follows the neural MT approach and uses data selection techniques and (ii) a rule-based MT system for Croatian–Serbian developed collaboratively during project secondments.

The techniques developed in the project for the use case of South Slavic languages have been applied successfully during the final period to other languages, thus demonstrating their high degree of language independence, as part of shared tasks. Namely, we built MT systems for English–Finnish as part of WMT2015 and WMT2016 and for Spanish–Catalan and Spanish–Basque as part of TweetMT2015.


The main research activities that we have carried out during the project to achieve these milestones are as follows:

- Web crawling. We have established a novel pipeline to crawl massive amounts of monolingual and parallel data from top level domains that is ready for commercial exploitation.

- Acquisition of linguistic resources (bilingual dictionaries and transfer rules). We have developed methodologies (i) to enable non-expert users to improve the coverage of morphological dictionaries and (ii) to learn automatically translation rules from very small parallel corpora.

- A novel procedure to clean publicly available corpora that are not usable for machine translation as they are.

- Implementation of a novel cloud-based language model that allows to use vast amounts of monolingual data in MT.

- Linguistically-augmented and morph segmentation approaches to statistical and neural machine translation.

- Improved data selection of training data for machine translation using linguistic information and quality estimation techniques.

- Development of a MT system between two closely-related languages (Croatian–Serbian) through a collaborative process.

- Participation in a number of shared tasks on several topics: document alignment, cleaning of parallel data, quality estimation. Our submissions obtained top rankings on MT, quality estimation and parallel data cleaning.

All the tools and data sets developed have been released according to open source licenses and can be found at the project's website.


Additional information about the project can be found in its website: http://www.abumatran.eu/