Semi-automatic indexing system for technical abstracts

Exploitable results

SISTA is designed for organizations compiling abstracts and indexing services in technical subject areas. In particular it supports indexers assigning controlled-language descriptors selected from a thesaurus. In addition to use in assigning descriptors for scientific abstracts, SISTA technology could be applied to other indexing situations and to the routing of texts, for example, of news service items. SISTA proposes possible descriptors for a document by analysing the text of the title and abstract and provides a personal computer (PC)-based interface allowing the indexer to prepare a final list of index terms. The indexers, as users, may work in-house or externally to the database compiler. SISTA uses natural language processing (NLP) techniques for the statistical and syntactic analysis of text in an existing corpus of documents, and to determine the statistical association of the resulting 'diagnostic units' with the originally assigned indexing. The resultant model is used to propose index terms for new texts. Work on several corpora of abstracts showed that optimal SISTA performance depends on the selection of document representation and on a descriptor assignment strategy appropriate to a corpus. Generally, a model using single association between diagnostic units and descriptors can exploit sophisticated representations such as noun groups better than does a probabilistic model.

Exploitable results

Udostępnij tę stronę

Pobierz