The aim of the project is to provide a computational methodology and, in more practical terms, a toolbox which will aid the human translator working in a particular subset of general language (a sublanguage) in the following two ways:
relieve him from the repetitive part of his work, mostly dealing with specialised types of text
to enhance productivity and translation quality by assisting him through proposed alternative solutions as well as providing sophisticated ancillary tools.
A prototype application demonstrating the validity of the approach and allowing it to be evaluated in terms of translator productivity will be produced as a result of the project. The project will initially consider four languages: English, French, Greek and Portugese.
TRANSLEARN is based upon sophisticated pattern matching techniques, involving both linguisitic and statistical processing, which are used to identify the longest coherent part of source text which has already been translated and stored in a text database in both source and translated form. In the case of a full match between a piece of source text and a database entry, the corresponding translated text can be output automatically. Statistically ranked alternative translations can also be provided, if they exist. If no full match is detected, a reconstruction and optimal evaluation of all the partial matches is performed which is then, together with a confidence measure, presented to the translator. Fragments of source text for which translations above a certain confidence threshold do not exist will be presented to the translator for him to translator for him to translate. The translation is then incorporated into the database for future use. Existing field-proven techniques and utilities will be used for he creation of the database of parallel texts.
TRANSLEARN will collect and investigate a large body of translated texts within a well-defined sublanguage and text type, including the EC CELEX database, select the most coherent and homogeneous set of standard texts, and store these in an appropriately designed text database using existing software text handling and alignment tools. A linguistically and statistically-based pattern-matching mechanism, to be triggered by a source text, will then be developed. The most frequently used fixed locations and syntactic structures in the sublanguage considered will be stored in a separate database, as will statistical data concerning the text database.
Maximum use of existing products and software techniques will be made, and the sublanguages used for the prototype will be from administrative (EC regulations etc) and technical (software documentation) texts. The prototype will limited to fairly simple morphological and syntactic processing, and to known statistical for clustering and taxonomy derivation for fixed locations.
TRANSLEARN attempts to combine the statistical and linguistic/AI approaches (which are often regarded as mutually incompatible) in a synergistic way, and produce a large database of appropriately organized, indexed parallel texts in two sublanguages in an easily accessible form. The prototype software package produced will be a powerful tool of pattern-matching and other intelligent applications. Tools of this kind are expected to turn into highly marketable products, and TRANSLEARN will be marketed both as a stand-alone utility and as an integral part of toolbox with wider scope. It is intended to extend the prototype to cover the remaining EC official languages, and to get feedback on its functionality from translation services dealing with the types of text covered by the project. The prototype may also be ported onto the DOS and Macintosh platforms.
WC1E 7HX London