The language industries of the future will rely heavily on the availability of large scale language resources e.g. corpora, speech databases, dictionaries, linguistic descriptions -- together with appropriate standards and methodologies. Ready access to harmonised databases of language data and rules would not only provide a direct benefit to research and development efforts across a wide range of private and public organisations, but would also foster fruitful academic and industrial co-operation.
The project aims to define a broad organisational framework for the creation of the language resources for both written and spoken language engineering (LRs in short) which are necessary for the development of an adequate language technology and industry in Europe, and to determine the feasibility of creating a co-ordinated European network of repositories which would perform the function of storing, disseminating and maintaining such resources. This activity is intended to contribute towards the long term goal of making large scale LRs widely available to European organisations involved in R&D and educational activities.
Approach and Methodology
The overall approach and the results which the project intends to achieve can be summarised as follows:
to create structured, publicly available catalogues of existing linguistic resources, using and extending the information already collected by various international and national survey initiatives;
to evaluate the present European situation, comparing what is available with the most urgent needs of the European R&D and teaching communities, and then to formulate recommendations for a concerted European action in the field of reusable resources for natural language and speech;
to discuss with the relevant actors (e.g. owners of resources, producers, private and public users, funding bodies, scientific and professional associations) the various aspects of the problem, their needs and requirements, the possible solutions, their willingness to co-operate, and the conditions for a joint European action;
to identify, describe and evaluate at various levels (e.g. organisational, technical, legal) alternative methods and structures which could ensure the creation, management and maintenance of a European repository of reusable LRs, and their dissemination to the various types of users;
to experiment with the collection and dissemination of existing LRs using (i) a distributed electronic network and (ii) CD-ROM pressing facilities, with the aim of encouraging the reuse of already available resources, and also of acquiring experience which will feed into the formulation of final recommendations;
to present final recommendations for establishing a collaborative infrastructure that will act as a collection, verification, management and dissemination centre, built on the foundation provided by existing European structures and organisations.
Assessing Existing Resources: carrying out a review of what LRs currently exist, both in Europe an elsewhere. The goal of this survey is not to produce a comprehensive, exhaustive catalogue of such resources, but rather to assess which needs of the various European languages are still not satisfied by the available resources, and to compare and characterise the situations of the different languages. The results of this evaluation effort will provide the basis for the general recommendations (see below).
Needs Analysis: determining the main resource needs of European actors involved in RTD training and system development; discussing the various aspects (e.g. legal, financial, organisational problems; participation and role of different types of public and private actors) of the actions required to meet the needs for LRs in Europe, as a basis for defining an overall organisational framework for the development of adequate LRs in Europe.
Experimental Implementation: testing the usefulness and feasibility of a distributed resource repository by implementing an infrastructure on which will be mounted a set of LRs; in particular we will experiment with the dissemination of LRs using ELSNET's existing infrastructure for LRs: (i) a wide-area network running the AFS server software, and (ii) the formatting, mastering and distributing of data by CD-ROM.
Recommendations: making detailed recommendations for the creation, management, and maintenance of a distributed, managed repository of reusable LRs, based on a detailed analysis and evaluation of the alternatives.
Exploitation and Future Prospects
The goal of the project is the co-ordinated collection and distribution of LRs, promoting awareness of the need for creating widely available LRs, and the promotion of consensus on an overall European strategy. Consequently, dissemination activities are central to the project. The project consortium comprises representatives of major European-wide bodies and associations, most notably ELSNET, ESCA and EACL, and will be assisted by an industrial steering committee composed of representatives of leading IT companies, publishers, PTTs and other providers of electronic information services.
The action will be carried out in co-operation with relevant European groups and with on-going initiatives such as EAGLES, and will imply amongst other things an analysis of existing international structures. It is expected that the experimental activities carried out within the project and the recommendations for further larger-scale operations will contribute to the establishment of a broad language infrastructure covering all Community languages.