PANACEA: automating the production of Language Resources for a globalised, multilingual economy

A web service-based production line that automates the stages in acquisition, production, updating and maintenance of Language Resources required in Machine Translation systems and other Language Technologies

A strategic challenge for Europe in today's globalised economy is to overcome language barriers through technological means. In particular, Machine Translation (MT) systems are expected to have a significant impact on the management of multilingualism in Europe, making it possible to translate the huge quantity of textual data produced, and thus, covering the needs of hundreds of millions of citizens. PANACEA addressed a critical thread to this vision: the so-called, language-resource bottleneck. Although MT technologies may consist of language independent engines, they highly depend on the availability of language-dependent data for their real-life implementation, i.e. they require Language Resources (LRs). In order to equip MT systems for every pair of European languages, for every domain, and for every text genre, appropriate LRs covering every language, domain and genre must be produced. Moreover, a Language Resource for a given language can never be considered complete or final. Language change and new knowledge domains emerge at rapid pace. A company willing to cover the enlarged Union market needs to produce and maintain 500 bilingual glossaries, for instance. Traditionally, LRs production is done by hand, and its high cost (highly skilled human work and development time) hindered full coverage. Automatic production lowers the cost and time required for producing basic LR for languages which are currently not well covered. Such reductions are the only way to guarantee a continuous supply of LRs that MT and other Language Technologies may demand in a multilingual Europe. PANACEA has contributed to demonstrate that the “LR bottleneck” problem can be effectively addressed by automation with the production of (a) the PANACEA Platform, (b) the web services integrated within the platform, (c) the associated workflows to manage the sequencing of web services (d) the tools for LR acquisition developed during the project and (e) the Language Resources (LRs) produced during the project, mainly for Machine Translation tools but not only, exploiting the platform, web services, and the specialized workflows. The PANACEA factory has been thoroughly evaluated within R&D and industrial settings. The platform and the LRs production lines based on advanced technological components have proved the feasibility of the concept. PANACEA’s contribution and potential impact has been demonstrated in an industrial evaluation carried out with the adaptation of Machine Translation systems to specific and specialized domains. In terms of effort, to produce a domain-adapted bilingual glossary of 1000 entries with PANACEA reduces costs from 30 person/hours to 0.5 person/hours. In terms of quality, there were no significant negative effects in the translation quality of the systems using automatically produced resources. A human evaluation showed that PANACEA domain-tuned SMT gained in quality up to a 6% with respect to the not tuned baseline, and that the quality of SMT with automatically acquired LRs was not significantly worse than the achieved for language pairs like Italian-German by other state-of-the-art systems as Google Translator. The factory PANACEA is now a distributed and interoperable platform of web services that can be chained to perform complex operations in the form of workflows. A number of tools including web-crawling, data cleaning, anonymization, alignment, PoS tagging, dependency parsing, etc., are offered as web services. These web services, which are offered for unrestricted use, have been documented and annotated with metadata and tags for easy discovery and operation in the PANACEA Registry. Most web services also offer a web-based interface for facilitating testing. By the selection of the appropriate web services, the user is able to chain different production lines that automate the stages involved in the acquisition, production, updating and maintenance of the LRs as required by MT and other Language Technologies. Interoperability by means of the PANACEA Common Interfaces makes it possible to choose and to select different web services without requiring workflow parameter modifications. This feature is, together with the easy to use workflow editor TAVERNA, an asset for the proliferation and sharing of different processing chains. PANACEA myExperiment includes the workflows combining the different web services for the production of language resources such as parallel corpora, bilingual glossaries, rich information lexica, etc. Workflows are accessible to be shared by users. To guide and facilitate user experience, a total of 10 videos are made available at offering several introductory scenarios, as well as documentation and tutorials. Publication and generated resources are additionally available open-access via the UPF Digital Repository as part of the European Union OpenAIRE initiative. The future. Interested in collaboration? The successful PANACEA results will be sustained by PANACEA partners who intent to exploit them with a business model based on the quick and cheap production of new resources on demand, mainly. Platform exploitation by third parties is also possible for academic or industrial research purposes with no cost in an attempt to gain visibility and credibility. Contact is welcome both from researchers on the area of development of language resources and for application developers interested in the availability of language resources whose production they want to automate. Project: PANACEA - Platform for Automatic, Normalized Annotation and Cost-Effective Acquisition of Language Resources for Human Language Technologies Project reference: 248064 Start date: 1 January 2010. End date: 31 December 2012 Coordinator: Núria Bel. Universitat Pompeu Fabra, Institut Universitari de Linguística Aplicada. C/Roc Boronat 138 2 floor. 08018 Barcelona, Spain. Phone: +34 93 542 1193; Email: info@panacea-lr.eu; Web: http://www.panacea-lr.eu/(opens in new window) Participants: Universitat Pompeu Fabra (Spain), Consiglio Nazionale delle Ricerche (Italy), Athena Research and Innovation Centre (Greece), University of Cambridge (UK), Linguatech (Germany), Dublin City University (Ireland), Evaluations and Language Resources Distribution Agency (France)

Keywords

Countries

Germany, Greece, Spain, France, Ireland, United Kingdom

Keywords

Countries

Download Download the content of the page