Multilingual Text Tools and Corpora

Informacje na temat projektu

MULTEXT

Identyfikator umowy o grant: LRE62050

Projekt został zamknięty

Data rozpoczęcia 1 Stycznia 1994

Data zakończenia 1 Marca 1996

Finansowanie w ramach

Specific programme of research and technological development (EEC) in the field of telematic systems in areas of general interest - Linguistic research and engineering -, 1990-1994

Koszt całkowity

Brak danych

Wkład UE

Brak danych

Koordynowany przez

Universite de Provence

Cel

The project seeks to contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, as well as guidelines for text software development, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives.

The consortium is committed to make its results, namely corpora, related tools, specifications and accompanying documentation, freely and publicly available.

Six major European companies are involved in the project as industrial partners. They will both contribute to the specification and development of the basic tools and provide a first indication of the exploitability of these tools by using them as a basis for building several high-level NLP applications.

Approach and Methodology

At the outset of the project, the consortium will undertake to analyse, test and extend the SGML based recommendations of the Text Encoding Initiative (TEI) on real-size data, and gradually develop encoding conventions specifically suited to multi-lingual corpora and the needs of NLP and MT corpus-based research. To manipulate large quantities of such texts, the partners will develop conventions for tool construction and use them to build a range of highly language-independent, atomic and extensible software tools.

These specifications will be the basis for the development of two major software resources, namely (a) tools for the linguistic annotation of texts (e.g. segmenters, morphological analysers, part of speech disambiguators, aligners, prosody taggers and post-editing tools), and (b) tools for the exploitation of annotated texts (e.g. tools for indexing, search and retrieval, statistics). This software will be implemented under UNIX, while its specific properties should facilitate portability to other systems. Moreover, it will be integrated by means of a common user interface into a text corpus manipulation system expected to provide the basic functionality needed in academic or industrial corpus research. For the overall software design as well as the development of specific components, MULTEXT will capitalise on the preliminary results achieved in the ALEP project.

By using the emerging software tools, the consortium plans to produce a substantial multilingual corpus, including parallel texts and spoken data, in six EC languages (English, French, Spanish, German, Italian and Dutch). The entire corpus will be marked for gross logical and structural features; a subset of the corpus will be marked and hand-validated for sentence and sub-sentence features, part of speech, alignment of parallel texts, and speech prosody. All markup will have to comply with the TEI-based corpus encoding conventions established within the project. The corpus will also serve as a testbed for the project tools and a resource for future tool development and evaluation.

An application programming interface will facilitate the coupling of the progressively refined software and data components with several existing language application systems or prototypes. In particular, the industrial partners plan to develop extraction software for lexical and terminological information to complement and improve their Terminology Management, Information Retrieval or Machine Translation systems. Some effort will also be devoted to a prototypical application for testing and comparing successive versions of a Machine Translation system.

Exploitation and Future Prospects

Text-oriented methods and software tools have come to be of primary interest to the NLP community. It is therefore expected that the availability of basic multi-lingual tools and data will improve and extend R&D across a wide range of disciplines, including not only the various areas of NLP (language understanding and generation, translation, etc.), but also fields such as speech technology, language learning, lexicography and lexicology, literary and linguistic computing, information retrieval, etc. By feeding the results into several commercial application systems/prototypes, the project is expected to show the potential of state-of-the-art methods in corpus linguistics for improving industrially relevant language systems and services.

By interacting with prominent research organizations and initiatives inside and outside the EC, it is hoped that MULTEXT's approach will receive the attention of a wide international forum. In a longer term perspective, it can be anticipated that this project will strengthen the methodological and technological foundations for the uniform representation, annotation and exploitation of textual information.

Dziedzina nauki (EuroSciVoc)

Klasyfikacja projektów w serwisie CORDIS opiera się na wielojęzycznej taksonomii EuroSciVoc, obejmującej wszystkie dziedziny nauki, w oparciu o półautomatyczny proces bazujący na technikach przetwarzania języka naturalnego. Więcej informacji: Europejski Słownik Naukowy.

Program(-y)

Wieloletnie programy finansowania, które określają priorytety Unii Europejskiej w obszarach badań naukowych i innowacji.

FP3-LRE - Specific programme of research and technological development (EEC) in the field of telematic systems in areas of general interest - Linguistic research and engineering -, 1990-1994

Temat(-y)

Zaproszenia do składania wniosków dzielą się na tematy. Każdy temat określa wybrany obszar lub wybrane zagadnienie, których powinny dotyczyć wnioski składane przez wnioskodawców. Opis tematu obejmuje jego szczegółowy zakres i oczekiwane oddziaływanie finansowanego projektu.

projects.no_data

Zaproszenie do składania wniosków

Procedura zapraszania wnioskodawców do składania wniosków projektowych w celu uzyskania finansowania ze środków Unii Europejskiej.

projects.no_data

System finansowania

Program finansowania (lub „rodzaj działania”) realizowany w ramach programu o wspólnych cechach. Określa zakres finansowania, stawkę zwrotu kosztów, szczegółowe kryteria oceny kwalifikowalności kosztów w celu ich finansowania oraz stosowanie uproszczonych form rozliczania kosztów, takich jak rozliczanie ryczałtowe.

projects.no_data

Koordynator

Universite de Provence

Wkład UE

Brak danych

Adres

29, Avenue Robert Schuman
13621 Aix-en-Provence Cedex 1

Koszt całkowity

Brak danych

Uczestnicy (13)

CAP Debis Systemhaus KSP GmbH

Niemcy

Wkład UE

Brak danych

Adres

Erich-Herion-Straße 11-13
70736 Fellbach

Koszt całkowity

Brak danych

Digital Equipment

Niderlandy

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

EUROLANG-SITE

Francja

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

ISSCO

Szwajcaria

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

Rank Xerox Research Centre

Francja

Wkład UE

Brak danych

Adres

6 chemin de Maupertuis
38240 Meylan

Koszt całkowity

Brak danych

Siemens Nixdorf Informationssysteme AG

Niemcy

Wkład UE

Brak danych

Adres

München

Koszt całkowity

Brak danych

Siemens Nixdorf-CDS

Hiszpania

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

UNIVERSITAT AUTONOMA DE BARCELONA

Hiszpania

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

UNIVERSITEIT UTRECHT

Niderlandy

Wkład UE

Brak danych

Adres

Heidelberglaan 8

Koszt całkowity

Brak danych

Universitat Central de Barcelona

Hiszpania

Wkład UE

Brak danych

Adres

Koszt całkowity

Brak danych

University of Edinburgh

Zjednoczone Królestwo

Wkład UE

Brak danych

Adres

Edinburgh

Koszt całkowity

Brak danych

Università degli Studi di Pisa

Włochy

Wkład UE

Brak danych

Adres

Pisa

Koszt całkowity

Brak danych

WESTFAELISCHE WILHELMS - UNIVERSITAET MUENSTER

Niemcy

Wkład UE

Brak danych

Adres

Domagkstrasse 5
48129 MUENSTER

Koszt całkowity

Brak danych

Cel

Program(-y) Wieloletnie programy finansowania, które określają priorytety Unii Europejskiej w obszarach badań naukowych i innowacji.

Temat(-y) Zaproszenia do składania wniosków dzielą się na tematy. Każdy temat określa wybrany obszar lub wybrane zagadnienie, których powinny dotyczyć wnioski składane przez wnioskodawców. Opis tematu obejmuje jego szczegółowy zakres i oczekiwane oddziaływanie finansowanego projektu.

Zaproszenie do składania wniosków Procedura zapraszania wnioskodawców do składania wniosków projektowych w celu uzyskania finansowania ze środków Unii Europejskiej.

Koordynator

Uczestnicy (13)

Pobierz Pobierz zawartość strony

Program(-y)

Wieloletnie programy finansowania, które określają priorytety Unii Europejskiej w obszarach badań naukowych i innowacji.

Temat(-y)

Zaproszenia do składania wniosków dzielą się na tematy. Każdy temat określa wybrany obszar lub wybrane zagadnienie, których powinny dotyczyć wnioski składane przez wnioskodawców. Opis tematu obejmuje jego szczegółowy zakres i oczekiwane oddziaływanie finansowanego projektu.

Zaproszenie do składania wniosków

Procedura zapraszania wnioskodawców do składania wniosków projektowych w celu uzyskania finansowania ze środków Unii Europejskiej.