MULTEXT: multilingual text tools and corpora

The project has developed a set of generally usable software tools to manipulate and analyse text corpora, together with lexicons and multilingual corpora in seven European languages. It has established conventions for the encoding of corpora and harmonized specifications for computational lexicons, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT has developed the first set of publicly available large-scale resources and tools for use in corpus-based language engineering applications. The project's specific achievements fall into three areas: lexical specifications for 7 European languages (English, French, Spanish, Italian, German, Dutch, Swedish), comprising the first large-scale application of and contribution to the EAGLES work in this area; specifications for encoding corpora in standard generalized markup language (SGML), comprising one of the first large-scale applications of the Text Encoding Initiative Guidelines; specification of a data architecture for linguistic corpora, providing the first hypertext view of such corpora. The tools include: a language-independent, parameterizable text tokenizer; a modular and language-independent part-of-speech tagger; a text aligner; a complete speech workbench; a public SGML query-language interpreter; a set of SGML-aware corpus exploitation tools. Text-oriented methods and software tools have come to be of primary interest to the natural language processing (NLP) community. The availability of basic multilingual tools and data will improve and extend research and development across a wide range of disciplines, including not only the various areas of language engineering, but also fields such as speech technology, language learning, lexicography and lexicology, information retrieval, etc. The project's methodologies and results are being used in a related project, thus extending the application to 13 western and eastern European languages.

