LANGUAGE ENGINEERING - PREPARATORY ACTION FOR LINGUISTIC RESOURCES ORGANIZATION FOR LANGUAGE ENGINEERING | LE-PAROLE | Project | Fact Sheet | FP4 | CORDIS

Objective

LE-PAROLE is concerned with a large-scale harmonised set of text databases (corpora) and lexica for all EU languages. These resources will have a wide range of applications, including design and testing in information technology, the production of language learning material and academic research. Each 20,000-entry lexicon will be based on a software tool extended to support both the conversion and management processes of the resulting resources. The project will produce large monolingual harmonised corpora which obey common markup conventions and are compatible with the lexicons.
Progress
During the first 9 months of activities the project has successfully started the creation of the corpora and the lexicons foreseen for the different European languages.
For each of the following languages a corpus of at least 20 million words and a lexicon of 20,000 lemmas will be produced: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish (lexicon only), Swedish. In addition, a corpus of respectively 20, 15, 3 million words will be produced for Belgian-French, Irish, Norwegian.
Corpora
Permission for use has been obtained from the copyright holders (publishers, newspapers, etc.). The conversion of texts from the source format to the TEI/EAGLES CES based PAROLE format is proceeding according to schedule. On average, 5-6 million running words are already available for the various languages. Semi-automatic tagging of part of the corpora is also regularly underway.
All the information explicitly represented in the source texts is encoded following essentially the CES (Corpus Encoding Standard) designed by EAGLES, on the basis of the TEI guidelines. 250,000 running words will be tagged at the morpho-syntactic level, following the EAGLES guidelines, instantiated by each PAROLE partner for his own language.
Each partner uses, in order to construct, mark-up, tag the corpus, a software package of its choice. The compatibility and interchangeability of the various corpora is ensured by the adoption of commonly defined criteria for composition, encoding and linguistic annotation.
Lexicons
The choice of the 20,000 lexical entries which will form the initial nucleus of the lexicons to be developed in the different countries has been performed. The morphological encoding has been almost completed for all the languages. The encoding at the syntactic level has started producing the initial SGML files ready to be loaded (imported) through a common filler into the common PAROLE lexical DB.
The PAROLE lexicon model is based on the results of LRE EAGLES and EUREKA GENELEX. Thanks to this, all the lexical resources being developed are declarative, theory and application independent, multifunctional and will be able to evolve easily, for example to incorporate other levels of information or to become multi-lingual. This approach which answers to the requisite of genericity, explicitness, and variability of granularity, will guarantee a large scale reusability. The model, with a high level of precision in the description, is in fact designed to ensure that application dependent models of data and applicative dictionaries can be derived from this repository of information, by mapping the application model from the generic one. The coverage is 20,000 entries per language described at the morphological and syntactic levels, and in few cases at the semantic level.
The availability of rather large, uniformly structured lexical resources in all the languages mentioned above will offer the users the benefits of a standardised base.
The exchange format for the lexicons, as for the corpora, is SGML: all the lexicons share the same DTD for the morphological and syntactic layers. Moreover, the use of a common set of lexicon management tools is a guarantee that all lexicons will fully conform to the model. The use of these tools is a precondition of an industrial level of quality for the volumes of data (in so many languages) that PAROLE is to deliver.
The Way Ahead
The work to create lexicons and corpora is now continuing 'à regime'. During 1997 the Consortium will continue the production of LR. The first drafts of guidelines for encoding corpora and lexicons (user manuals) will be prepared. These guidelines, together with the availability of data encoded according to EAGLES/TEI standards, will concretely contribute to the dissemination of this standards. The validation phase will also begin, in co-operation with ELRA.
Availability
All the lexicons will be publicly available, at conditions to be determined within the project. Each corpus will be accessible via INTERNET. A subset of 3 million words of each corpus (including the tagged words) will also be 'distributable': i.e. a physical copy of it can be given to the users. Co-operation with ELRA will be sought to this end. Restrictions on the type of usage will depend on the restrictions imposed by the holders of the copyright of the source texts, when they have authorised the inclusion of their texts in the corpus.

Fields of science (EuroSciVoc)

CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

Programme(s)

Multi-annual funding programmes that define the EU’s priorities for research and innovation.

FP4-TELEMATICS 2C - Specific programme of research and technological development and demonstration in the area of telematic applications of common interest, 1994-1998

Topic(s)

Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

D.12 - Language Engineering

Call for proposal

Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

projects.no_data

Funding Scheme

Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.

CSC - Cost-sharing contracts

Coordinator

Università degli Studi di Pisa

EU contribution

No data

Address

Via della Faggiola 32
56100 Pisa
Italy

Total cost

No data

Participants (14)

CENTRO DE LINGUISTICA DA UNIVERSIDADE DE LISBOA

Portugal

EU contribution

No data

Address

AVENIDA 5 OUTUBRO
1050 LISBOA

Total cost

No data

DET DANSKE SPROG -OG LITTERATURSELSKAB

Denmark

EU contribution

No data

Address

18A,FEDERIKSHOLMS KANAL
1220 COPENHAGEN K

Total cost

No data

FUNDACION BOSCH GIMPERA UNIVERSITAT DE BARCELONA

Spain

EU contribution

No data

Address

BARCELONA

Total cost

No data

GOETEBORGS UNIVERSITET

Sweden

EU contribution

No data

Address

6,RENSTROMSGATAN
GOTHENBURG

Total cost

No data

GSI-ERLI

France

EU contribution

No data

Address

1,PLACE DES MARSEILLAIS
94227 CHARENTON

Total cost

No data

INSTITIUID TEANGEOLAIOCHTA EIREANN

Ireland

EU contribution

No data

Address

FITZWILLIAM PLACE
2 DUBLIN

Total cost

No data

INSTITUT D'ESTUDIS CATALANS

Spain

EU contribution

No data

Address

47,CARRER DEL CARME
08001 BARCELONA

Total cost

No data

INSTITUT NATIONAL DE LA LANGUE FRANCAISE

France

EU contribution

No data

Address

AVENUE DE LA GRILLE D'HONNEUR, LE PARC
92211 SAINT-CLOUD CEDEX

Total cost

No data

INSTITUUT VOOR NEDERLANDSE LEXICOLOGIE

Netherlands

EU contribution

No data

Address

2-3,MATTHIAS DE VRIESHOF
2311 BZ LEIDEN

Total cost

No data

Institut für Deutsche Sprache

Germany

EU contribution

No data

Address

68016 Mannheim

Total cost

No data

Institute for Language and Speech Processing (ILSP)

Greece

EU contribution

No data

Address

22,Margari Street
11525 Athens

Total cost

No data

UNIVERSITY OF BIRMINGHAM

United Kingdom

EU contribution

No data

UNIVERSITY OF HELISINKI

Finland

EU contribution

No data

Address

8,KESKUSKATU
00014 HELISINKI

Total cost

No data

UNIVERSITY OF LIEGE

Belgium

EU contribution

No data

Address

7,PLACE DU XX AOUT
4000 LIEGE

Total cost

No data

LANGUAGE ENGINEERING - PREPARATORY ACTION FOR LINGUISTIC RESOURCES ORGANIZATION FOR LANGUAGE ENGINEERING

Objective

Fields of science (EuroSciVoc) CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

Programme(s) Multi-annual funding programmes that define the EU’s priorities for research and innovation.

Topic(s) Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

Call for proposal Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

Funding Scheme Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.

Coordinator

Participants (14)

Download Download the content of the page

Fields of science (EuroSciVoc)

CORDIS classifies projects with EuroSciVoc, a multilingual taxonomy of fields of science, through a semi-automatic process based on NLP techniques. See: The European Science Vocabulary.

Programme(s)

Multi-annual funding programmes that define the EU’s priorities for research and innovation.

Topic(s)

Calls for proposals are divided into topics. A topic defines a specific subject or area for which applicants can submit proposals. The description of a topic comprises its specific scope and the expected impact of the funded project.

Call for proposal

Procedure for inviting applicants to submit project proposals, with the aim of receiving EU funding.

Funding Scheme

Funding scheme (or “Type of Action”) inside a programme with common features. It specifies: the scope of what is funded; the reimbursement rate; specific evaluation criteria to qualify for funding; and the use of simplified forms of costs like lump sums.