Final Report Summary - LATEST (Advanced LAnguage TEchnology Platform for TranSlaTors (LATEST))

The objective of LATEST which drew on the input from and collaboration with translators, was the development of original methodology for, and the implementation of, a novel LAnguage TEchnology platform for tranSlaTors (LATEST) to automatically identify multiword expressions (collocations) and provide their translations, thus assisting translators and interpreters to understand and translate them.

LATEST regards the task of translating multiword expressions as a two-stage process. The first stage is the extraction of multiword expressions (MWEs) in each of the languages; the second stage is a matching procedure for the extracted MWEs in each language which proposes the translation equivalents.

The methodology which works for any pair of languages, is based on a knowledge-poor approach which does not depend on translation resources such as dictionaries, translation memories or parallel corpora which can be time consuming to develop or difficult to acquire, being expensive or proprietary. The only information comes from comparable corpora, inexpensively compiled. The implementation covers English and Spanish and focuses on a particular subclass of multiword expressions (MWEs) verb-noun expressions (collocations).

In the MWE extraction phase, statistical association measures were employed to quantify the strength of the relationship between two words and to propose that a combination of a verb and a noun above a specific threshold would be a (candidate for) multiword expression. In the MWE translation phase, distributional similarity methods were applied the premise being that MWEs have the same or very similar contexts as their translation equivalents.

The comparable corpora compiled for this project covered the newswire genre which was determined by the fact that newswire is a widespread genre and available in different languages. As the approach was developed as language independent, the intention was to have it tested for different languages after the lifetime of the project.

Extensive evaluation experiments were conducted both for the performance in the extraction and translation phase, with detailed comparison of the various measures provided and the interannotator agreement of all annotators computed. The effect of the quality of the comparable corpora and as well as their size was investigated as well. The evaluation results point to a very interesting finding and sheds light for the first time on the following. It is the quality of the comparable corpora that is more important than the size of the data for the performance of automatic translation of MWEs.

LATEST achieved its specific research objectives and research training objectives. The completion of the research training objectives and the acquisition of new skills, were instrumental in the fellow successfully refocusing his research into a new area, that of (computational) phraseology where he achieved significant novel results. This is evidenced by the publications related to the project and invitations to give talks at international conferences. Wide range of dissemination and outreach activities contributed to publicising the project and making the outputs easy to understand.

