MULTILINGUAL TEXT TOOLS AND CORPORA FOR CENTRAL AND EASTERN EUROPEAN LANGUAGES

Exploitable results

MULTEXT-East language resources are a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotations; medium scale morphosyntactic lexica; and annotated parallel, comparable, and speech corpora. The most important component is the linguistically annotated corpus consisting of Orwell's novel "1984" in the English original and translations. The resources are the results of several EU projects: MULTEXT-East (produced linked resources for Romanian, Slovene, Czech, Bulgarian, Estonian, Hungarian and English), TELRI (added resources for Lithuanian, Croatian, Serbian, and Russian; first release), and CONCEDE (validation, re-encoding; partial re-release). The latest version of the resources is Version 3 (April 2004), which brings together the first two, makes them available in TEI P4 XML, and introduces further extensions, e.g. the specification for Resian, a dialect of Slovene. This dataset, unique in terms of languages and the wealth of encoding, is extensively documented, and freely available for research purposes. For commercial exploitation, the licence has to be negotiated for each language separately with the relevant partner of the project. Consult the MULTEXT-East web site for further information: http://nl.ijs.si/ME/

Exploitable results

Share this page

Download