LE-PAROLE is concerned with a large-scale harmonised set of text databases (corpora) and lexica for all EU languages. These resources will have a wide range of applications, including design and testing in information technology, the production of language learning material and academic research. Each 20,000-entry lexicon will be based on a software tool extended to support both the conversion and management processes of the resulting resources. The project will produce large monolingual harmonised corpora which obey common markup conventions and are compatible with the lexicons.
During the first 9 months of activities the project has successfully started the creation of the corpora and the lexicons foreseen for the different European languages.
For each of the following languages a corpus of at least 20 million words and a lexicon of 20,000 lemmas will be produced: Catalan, Danish, Dutch, English, Finnish, French, German, Greek, Italian, Portuguese, Spanish (lexicon only), Swedish. In addition, a corpus of respectively 20, 15, 3 million words will be produced for Belgian-French, Irish, Norwegian.
Permission for use has been obtained from the copyright holders (publishers, newspapers, etc.). The conversion of texts from the source format to the TEI/EAGLES CES based PAROLE format is proceeding according to schedule. On average, 5-6 million running words are already available for the various languages. Semi-automatic tagging of part of the corpora is also regularly underway.
All the information explicitly represented in the source texts is encoded following essentially the CES (Corpus Encoding Standard) designed by EAGLES, on the basis of the TEI guidelines. 250,000 running words will be tagged at the morpho-syntactic level, following the EAGLES guidelines, instantiated by each PAROLE partner for his own language.
Each partner uses, in order to construct, mark-up, tag the corpus, a software package of its choice. The compatibility and interchangeability of the various corpora is ensured by the adoption of commonly defined criteria for composition, encoding and linguistic annotation.
The choice of the 20,000 lexical entries which will form the initial nucleus of the lexicons to be developed in the different countries has been performed. The morphological encoding has been almost completed for all the languages. The encoding at the syntactic level has started producing the initial SGML files ready to be loaded (imported) through a common filler into the common PAROLE lexical DB.
The PAROLE lexicon model is based on the results of LRE EAGLES and EUREKA GENELEX. Thanks to this, all the lexical resources being developed are declarative, theory and application independent, multifunctional and will be able to evolve easily, for example to incorporate other levels of information or to become multi-lingual. This approach which answers to the requisite of genericity, explicitness, and variability of granularity, will guarantee a large scale reusability. The model, with a high level of precision in the description, is in fact designed to ensure that application dependent models of data and applicative dictionaries can be derived from this repository of information, by mapping the application model from the generic one. The coverage is 20,000 entries per language described at the morphological and syntactic levels, and in few cases at the semantic level.
The availability of rather large, uniformly structured lexical resources in all the languages mentioned above will offer the users the benefits of a standardised base.
The exchange format for the lexicons, as for the corpora, is SGML: all the lexicons share the same DTD for the morphological and syntactic layers. Moreover, the use of a common set of lexicon management tools is a guarantee that all lexicons will fully conform to the model. The use of these tools is a precondition of an industrial level of quality for the volumes of data (in so many languages) that PAROLE is to deliver.
The Way Ahead
The work to create lexicons and corpora is now continuing 'à regime'. During 1997 the Consortium will continue the production of LR. The first drafts of guidelines for encoding corpora and lexicons (user manuals) will be prepared. These guidelines, together with the availability of data encoded according to EAGLES/TEI standards, will concretely contribute to the dissemination of this standards. The validation phase will also begin, in co-operation with ELRA.
All the lexicons will be publicly available, at conditions to be determined within the project. Each corpus will be accessible via INTERNET. A subset of 3 million words of each corpus (including the tagged words) will also be 'distributable': i.e., a physical copy of it can be given to the users. Co-operation with ELRA will be sought to this end. Restrictions on the type of usage will depend on the restrictions imposed by the holders of the copyright of the source texts, when they have authorised the inclusion of their texts in the corpus.
Funding SchemeCSC - Cost-sharing contracts
1220 Copenhagen K
92211 Saint-cloud Cedex
2311 BZ Leiden