Skip to main content

LANGUAGE ENGINEERING - PREPARATORY ACTION FOR LINGUISTIC RESOURCES ORGANIZATION FOR LANGUAGE ENGINEERING

Objective

LE-PAROLE is concerned with a large-scale harmonised set of text databases (corpora) and lexica for all EU languages. These resources will have a wide range of applications, including design and testing in information technology, the production of language learning material and academic research. Each 20,000-entry lexicon will be based on a software tool extended to support both the conversion and management processes of the resulting resources. The project will produce large monolingual harmonised corpora which obey common markup conventions and are compatible with the lexicons.
Progress
During the first 9 months of activities the project has successfully started the creation of the corpora and the lexicons foreseen for the different European languages.
For each of the following languages a corpus of at least 20 million words and a lexicon of 20,000 lemmas will be produced: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish (lexicon only), Swedish. In addition, a corpus of respectively 20, 15, 3 million words will be produced for Belgian-French, Irish, Norwegian.
Corpora
Permission for use has been obtained from the copyright holders (publishers, newspapers, etc.). The conversion of texts from the source format to the TEI/EAGLES CES based PAROLE format is proceeding according to schedule. On average, 5-6 million running words are already available for the various languages. Semi-automatic tagging of part of the corpora is also regularly underway.
All the information explicitly represented in the source texts is encoded following essentially the CES (Corpus Encoding Standard) designed by EAGLES, on the basis of the TEI guidelines. 250,000 running words will be tagged at the morpho-syntactic level, following the EAGLES guidelines, instantiated by each PAROLE partner for his own language.
Each partner uses, in order to construct, mark-up, tag the corpus, a software package of its choice. The compatibility and interchangeability of the various corpora is ensured by the adoption of commonly defined criteria for composition, encoding and linguistic annotation.
Lexicons
The choice of the 20,000 lexical entries which will form the initial nucleus of the lexicons to be developed in the different countries has been performed. The morphological encoding has been almost completed for all the languages. The encoding at the syntactic level has started producing the initial SGML files ready to be loaded (imported) through a common filler into the common PAROLE lexical DB.
The PAROLE lexicon model is based on the results of LRE EAGLES and EUREKA GENELEX. Thanks to this, all the lexical resources being developed are declarative, theory and application independent, multifunctional and will be able to evolve easily, for example to incorporate other levels of information or to become multi-lingual. This approach which answers to the requisite of genericity, explicitness, and variability of granularity, will guarantee a large scale reusability. The model, with a high level of precision in the description, is in fact designed to ensure that application dependent models of data and applicative dictionaries can be derived from this repository of information, by mapping the application model from the generic one. The coverage is 20,000 entries per language described at the morphological and syntactic levels, and in few cases at the semantic level.
The availability of rather large, uniformly structured lexical resources in all the languages mentioned above will offer the users the benefits of a standardised base.
The exchange format for the lexicons, as for the corpora, is SGML: all the lexicons share the same DTD for the morphological and syntactic layers. Moreover, the use of a common set of lexicon management tools is a guarantee that all lexicons will fully conform to the model. The use of these tools is a precondition of an industrial level of quality for the volumes of data (in so many languages) that PAROLE is to deliver.
The Way Ahead
The work to create lexicons and corpora is now continuing 'à regime'. During 1997 the Consortium will continue the production of LR. The first drafts of guidelines for encoding corpora and lexicons (user manuals) will be prepared. These guidelines, together with the availability of data encoded according to EAGLES/TEI standards, will concretely contribute to the dissemination of this standards. The validation phase will also begin, in co-operation with ELRA.
Availability
All the lexicons will be publicly available, at conditions to be determined within the project. Each corpus will be accessible via INTERNET. A subset of 3 million words of each corpus (including the tagged words) will also be 'distributable': i.e., a physical copy of it can be given to the users. Co-operation with ELRA will be sought to this end. Restrictions on the type of usage will depend on the restrictions imposed by the holders of the copyright of the source texts, when they have authorised the inclusion of their texts in the corpus.

Funding Scheme

CSC - Cost-sharing contracts

Coordinator

Università degli Studi di Pisa
Address
Via Della Faggiola 32
56100 Pisa
Italy

Participants (14)

CENTRO DE LINGUISTICA DA UNIVERSIDADE DE LISBOA
Portugal
Address
Avenida 5 Outubro
1050 Lisboa
DET DANSKE SPROG -OG LITTERATURSELSKAB
Denmark
Address
18A,federiksholms Kanal
1220 Copenhagen K
FUNDACION BOSCH GIMPERA UNIVERSITAT DE BARCELONA
Spain
Address

Barcelona
GOETEBORGS UNIVERSITET
Sweden
Address
6,Renstromsgatan
Gothenburg
GSI-ERLI
France
Address
1,Place Des Marseillais
94227 Charenton
INSTITIUID TEANGEOLAIOCHTA EIREANN
Ireland
Address
Fitzwilliam Place
2 Dublin
INSTITUT D'ESTUDIS CATALANS
Spain
Address
47,Carrer Del Carme
08001 Barcelona
INSTITUT NATIONAL DE LA LANGUE FRANCAISE
France
Address
Avenue De La Grille D'honneur, Le Parc
92211 Saint-cloud Cedex
INSTITUUT VOOR NEDERLANDSE LEXICOLOGIE
Netherlands
Address
2-3,Matthias De Vrieshof
2311 BZ Leiden
Institut für Deutsche Sprache
Germany
Address

68016 Mannheim
Institute for Language and Speech Processing (ILSP)
Greece
Address
22,Margari Street
11525 Athens
UNIVERSITY OF BIRMINGHAM
United Kingdom
Address
Edgbaston
B15 2TT Birmingham
UNIVERSITY OF HELISINKI
Finland
Address
8,Keskuskatu
00014 Helisinki
UNIVERSITY OF LIEGE
Belgium
Address
7,Place Du Xx Aout
4000 Liege