Skip to main content

Article Category

News

Article available in the folowing languages:

ERAUZTERM: tool for extracting lexical terms in Basque

The IXA group at the University of the Basque Country and the Elhuyar Foundation have developed a tool for extracting terms from Basque texts. The tool uses linguistic and statistical techniques and has been implemented with XML technology.

The Erauzterm project falls within the framework of the Hizking 21 strategic project the objective of which is to encourage basic research in the field of language engineering and to develop the tools and resources for Basque that exist in other languages. Basque is an agglutinative language and, thus, it is necessary to take into account morphosyntactical information when choosing candidate terms which exist in a text. Moreover, compared with more normalised languages, Basque has a greater dispersal of terms., ,Design and basics of the tool,The Erauzterm tool currently extracts terms of nominal syntagmata. In order to select the most common and indicative nominal syntagma structures in Basque, the work previously carried out by the IXA group (Urizar et. al., 2000) was used, this work being completed adding new models. To this end, a sample of 50,000 words has been processed manually. This sample has 48 articles on IT published on the Zientzia.net webpage of the Elhuyar Foundation. The terminologists have extracted the terms from this sample and analysed their morphosyntactical structure. This reference sample, apart from being used to determine pattern terms, serves to assess the results of the Erauzterm extraction process. The first step in the automatic extraction is to implement the texts in XML format. Then, the untreated XML corpus is processed linguistically by means of the Euslem labeller. Thus, information about the lemma is labelled in the morphosyntactic and flexive categories of the terms. In order to detect chains of text according to pattern terms, models have been drawn up by means of a grammar and all this compiled in a finite state transductor. This transductor has the labelled corpus obtained with Euslem as input and detects the longest chains of text from the morphosyntactic models. Then, it undertakes the analysis of the subsyntagmata in order to extract the terms nesting in the longest chains. With this process a list of candidate terms is obtained that, finally, the statistical model classifies and presents, according to probability of elicitation. They are both single-word terms and multiple-word terms, each using different statistical techniques. In the case of simple terms, the relative frequency of these is compared with their frequency in the language generally. In the case of terms made up of several words, union techniques are analysed. Regarding the format of the texts, the Erauzterm tool admits various formats. The user has the option of knowing the context of the extracted term, including the option of validating and exporting the term. To this end, the logic construction of the tool is divided into three sections: the user interface, the processing logic and data management. In the physical design, a navigator, a server and an XML (Berkeley DB XML) database have been used. Conclusions,According to research undertaken in this field, it is not possible to totally optimise the coverage/precision relationship of this type of tool. In any case, in the systems where there exists the option of manually validating the terms, it is more logical to search maximum coverage. The results obtained with Erauzterm provide a 60/35 relation of coverage/precision. Coverage measures the relation between the extracted terms and the terms present in the text. Precision, on the other hand, is provided by the relation between the correct extracted terms and the units proposed by the extractor as terms. In order to undertake an ,evaluation of the tool, both data are necessary. The next stage in the project will be to enhance this relation, mainly the coverage. The main problems arising are: a) the treatment of terms which are not of Basque origin, b) the improvement of the analysis of terms that Euslem does not recognise (assignation of the lemma, morphosyntactic category and removal of ambiguity) and c) the treatment of postpositions. These last give rise to many problems (noise) and this is why the grammar of the system is being made more suitable. Regarding the objectives set for the future, mention has to be made of the treatment of the variations of the terms, extraction by means of automatic learning and establishing conceptual relations between terms. The tools designed for other languages normally provide poor results for Basque and, so Erauzterm is a great advance in this sense.The tool was presented at the LREC 2004 and GLAT 2004 international congresses.

Keywords

Language engineering

Countries

Spain