Skip to main content

SHALLOW PARSING AND KNOWLEDGE EXTRACTION FOR LANGUAGE ENGINEERING

Objective

SPARKLE will address the increasing need for EU citizens to access and exchange knowledge in several languages by producing pilot applications in multilingual information retrieval and speech dialogue. It will do so through development of flexible tools for semi-automatic induction of linguistic knowledge. Analysis of naturally occurring text will give fast sound access to information in a number of languages for a variety of extraction, indexing and retrieval applications. Semi-automatic lexical acquisition will allow cost effective creation of lexicons rich enough for use in a range of language engineering applications.
Progress
Project objectives presuppose the solution of at least the following three problems:
- appropriate segmentation of text into syntactically parseable units;
- selection of the unique semantically and pragmatically correct analysis from the potentially large number of syntactically legitimate ones possible; and
- undergeneration, i.e. dealing with cases of input outside the systems' lexical knowledge or syntactic coverage.
In the SPARKLE perspective, initial 'shallow' parsers are developed as a component of a lexical acquisition system which will be used to go through large text corpora, track down and acquire relevant lexical knowledge and construct dictionaries. Shallow parsing is seen as an intermediate step geared to lexical acquisition.
The project took a number of new approaches in order to significantly extend the work carried out so far in the area of robust parsing. A variety of software modules have been implemented, including: name and number recognizers, 'taggers', which assign a part of speech to each word, 'chunkers', to segment a text into a sequence of smaller organised text units, stochastic parsers, to rank multiple analyses according to the probability of their occurrence in natural texts, and augmented phrase structure parsers.
An evaluation scheme for parsers has also been defined and work has begun on the automatic learning of grammatical rules in the face of parsing failure.
Specifications for Shallow Parsing
For evaluation purposes, a multi-level syntactic mark-up scheme has been developed, based on the results of the EAGLES Text Corpora Working Group. The scheme meets the following general desiderata, with a view to both lexical acquisition and other applications:
- the syntactic mark-up scheme should only contain relevant information;
- the syntactic mark-up should only include what a parser can reliably output;
- the syntactic mark-up should be amenable to language-specific parameterisation.
Given the bootstrapping approach to parser lexicalisation adopted within the project, three different annotation levels were identified: chunking; phrasal constituency; and dependency.
Each level is geared to a specific task, either as input to lexical acquisition or in evaluating the parser
Specifications for Lexicon Structure
The lexical information which SPARKLE 'scours' in text corpora is expected to be couched in computational lexicons in a fairly straightforward way and then be used to enhance performance of lexicalised versions of related parsers. Diversity in both form and content of the lexical information contained in existing lexicons at each SPARKLE site, made it impractical to develop a unique SPARKLE lexicon. Nonetheless it is convenient that they share a common base to allow comparison between different methodologies of knowledge acquisition in different languages. The SPARKLE description language for lexicon encoding (S-DL) was designed to fulfil these purposes, which can also be seen as a first step in the direction of multi-lingual lexicon compilation.
Acquisition of Lexical Knowledge
In Pisa, work has concentrated on developing a Case-Based Reasoning approach to extracting lexical information from machine-readable dictionaries. Stuttgart and Cambridge have demonstrated systems for the automatic acquisition of sub-categorisation information. In both cases, the results rival the best results reported for automatic systems in terms of precision and recall.
Multi-lingual Information Retrieval (MIR)
Rank Xerox European Research Centre (RXRC) is interested in finding ways of seamlessly accessing information across languages. As an application of the results of the research partners, RXRC intends to produce a demonstrator. This will accept queries in one language in order to access documents written in another language.
Sharp Laboratories of Europe
Sharp Laboratories will build a system which uses shallow parsing and lexical disambiguation to enhance translation and retrieval tasks. During the first year of the project, SLE's contribution has focused on the specification of linguistic data structures for the parsers and the investigation of word disambiguation techniques. Preliminary results on disambiguation include prototype implementations for the acquisition of cooccurrence restrictions from bracketed corpora, and the automatic selection of translation candidates.
Speech Recognition
The Daimler Benz speech recognition system, uses a grammar containing both morpho-syntactic and semantic information. The role of the parser is to find the best scored path covering the whole utterance in a graph of acoustically scored word hypotheses generated by the word recognition module of the dialogue system, where the sequence of words is interpretable according to the grammar and the dialogue context.
In order to cope with the special problems of speech recognition and understanding, a lexico-grammar is required that must not over-generate, even if not all possible uttered sentences in the dialogue application can be analysed.
A lexicon compiler will be built to transform standardised EAGLES lexicons into the format required by the speech dialogue system.
The Way Ahead
The focus of future work will be the lexical acquisition phase of SPARKLE. In keeping with SPARKLE's practice of evaluating diverse approaches, the range of lexicon acquisition methods includes fully automatic ones, exploitation of existing resources, and traditional, 'manual' means. In general, all partners expect feedback from the lexicon acquisition phase to feed back into the first generation parsers, improving their performance and capabilities. At year's end, the acquisition prototypes will be jointly evaluated, using the metrics of lexicon evaluation developed in 1996. In addition, the RXRC and SLE prototypes will be integrated with some of the tools produced, such as the shallow parsers, to generate large-scale lexical resources. These resources will be evaluated using an adapted version of in-house translation technology. This work will lead to the development of the translation component for the final MIR system.
Advancements in economic integration are now progressively characterizing the European community as a Multilingual Information Society in which full participation increasingly relies on accurate and immediate access, consumption, exchange and dissemination of knowledge in a variety of languages.

The development of language models for real-world NLP applications requires flexible tools for semi-automatic induction of linguistic knowledge from text corpora. The SPARKLE consortium plans to satisfy this requirement through the achievement of the following goals.
First, software tools will be developed which are able to produce a phrasal-level syntactic analysis of naturally occurring free text which can be easily parameterised by language.
The second goal of SPARKLE is to develop a lexical acquisition system capable of learning the aspects of word knowledge from free text which are needed for language engineering applications. The creation of such tools will make it possible to build sufficiently rich NLP lexicons in a cost-effective manner.

The parsers and lexicons produced in the project will be used by the industrial partners to build pilot applications in the areas of multilingual information retrieval and speech dialogue.

Progress and results

The project will produce reports detailing the results of major tasks. Software produced will include phrasal-parsers and grammars and semantic lexicon acquisition tool.

Exploitation

Industrial applications will include machine-aided translation tools for information retrieval services, multilingual information retrieval and speech dialogue systems.

Funding Scheme

CSC - Cost-sharing contracts

Coordinator

Università degli Studi di Pisa
Address
Via Della Faggiola 32
56100 Pisa
Italy

Participants (5)

DAIMLER BENZ AKTIENGESELLSCHAFT
Germany
Address
225,Epplestrasse
70546 Stuttgart
RANK XEROX RESEARCH CENTRE SA
France
Address
6,Chemin De Maupertuis
38240 Meylan
Sharp Laboratories of Europe Ltd
United Kingdom
Address
Edmund Halley Road Oxford Science Park
OX4 4GA Oxford
THE CHANCELLOR, MASTERS AND SCHOLARS OF THE UNIVERSITY OF CAMBRIDGE
United Kingdom
Address
Pembroke Street, New Museums Site
CB2 3QG Cambridge
UNIVERSITAET STUTTGART
Germany
Address
Keplerstrasse 7
70174 Stuttgart