Skip to main content

Reduction of Noise and Silence in Full Text Retrieval Systems for Legal Texts

Objective

The problems of accessing texts in a large textual database are not restricted to the legal world, although they are as acute here as in most other spheres of activity. Existing retrieval mechanisms depend upon the user being able to formulate his query in the form of strings of keywords included in an inverted list (a primitive index), which restrics usage of the texts as it ignores some properties of the legal sublanguage.

The RENOS project aims to develop software modules capable of being integrated into existing Full Text Retrieval Systems (FTRS) which will reduce the levels of "noise" and "silence" of such systems when applied to legal texts. "Noise" is defined as the retrieval of texts of little or no relevance to user queries, while "silence" is defined as failing to retrieve relevant texts from the database. The software modules will implement a semi-automatic methodology for identifying legal terms (single-word and compound terms) in legal texts originating from several European member states by statistical means and by morphological and linguistic analysis.

Approach and Methodology

The approach adopted in the project is the creation of an "intelligent inverted list", which comprises a lexicon of single-word and compound terms, a hierarchically arranged conceptual network and a constituent grammar. Lexicon entries will be linked to nodes in the network and these nodes - "concepts" - will form the basis of text retrieval. Constituent grammars will offer linguistic criteria for identification of compound terms and ambiguous terms, i.e. words used both as a legal term and in the general language meaning.

The lexicon will contain a framed representation of single-word and compound legal terms, which will be stored by their stems together with pointers to inflectional patterns. Nodes in the conceptual network will consist of semantic classes pertaining to legal terms - "concepts" - organized in a tree structure. Pointers from lexical entries to the concepts in the network will be established, synonymous terms pointing to the same node. The constituent grammar will contain rules for the identification of compound terms and disambiguation of the meaning (legal or general) of single word terms in context.

The components of the network will be manually built in the prototype system, following automatic extraction of an initial set of terms from a corpus of legal text containing legislation common to Community countries. Part of the software to be built will establish the links between network concepts (nodes) and the corpus by applying grammar rules on appropriate corpus segments. Another part will implement a mini Text Retrieval System using the Intelligent Inverted List, demonstrating its benefits over traditional methods of text retrieval. Evaluation stages will quantify the performance of the RENOS system with respect to existing FTRSs.

Exploitation and Future Prospects

The end result of the RENOS project will be a piece of software which, with some additional development work, may support a multilingual FTRS, and the two private companies in the consortium, both legal information providers, plan to exploit this directly. Databank S. A. will explore the possibilities of incorporating tools and methodologies in the NOMOS database, and SOGEI will similarly attempt to integrate the conceptual legal term network into some of its existing products and services.

The collection of legal terms in three European languages is a key feature of the project, together with the evaluation and refinement of automated tools for the acquisition of terminological resources by statistical means. Extension to other languages and subject areas (engineering standards, medical texts) is envisaged.

Incorporation of the intelligent inverted list demonstrated in RENOS into existing FTRSs will greatly improve their query mechanisms, and the RENOS system could eventually be directly commercialized via direct sales to text retrieval companies and information providers.

Topic(s)

Data not available

Call for proposal

Data not available

Funding Scheme

Data not available

Coordinator

Databank S.A.
Address
124, Kifissias Ave. & Iatridou St
11526 Athens
Greece
 

Participants (5)

CEF Management Research Centre
Denmark
 
INTRASOFT S.A.
Greece
 
Institute for Language and Speech Processing (ILSP)
Greece
 
Istituto di Linguistica Computazionale
Italy
 
Società Generale d'Informatica SpA
Italy