Multilingual information retrieval dialogue system

A research project was set up to develop a multilingual multifunctional information retrieval system. The system is implemented in a 2 stage approach: first a system is trained with the read speech using the special training program. The stochastic language models of Slavic languages, such as Czech, Slovak and Slovenian, for spontaneous speech were created with the help of the large corpus of 10 000 training sentences (for each language). The national recognizers produce N-best word sequences rescored using a polygram language model as an input to the second processing step (linguistic analysis). The Czech recognizer was evaluated on microphone quality of speech and a word accuracy of 74% to 86% for speaker independent recognition was achieved. The linguistic analysis of the user utterances is realized with a language independent approach (keyword classification trees) and language dependent substring parsers. The dialogue manager, or dialogue module, interprets the meaning of the input utterance and produces an appropriate answer or a clarifying question. The input to the dialogue module is the semantic interpretation of the utterance represented in SIL, produced by the linguistic module. In the dialogue interpretation process the semantic interpretations and the dialogue model are matched, deciding the subsequent steps of cooperative user interaction in the structured dialogue model. Using this partitioned interactional model, dialogue management is partly independent of the language and the information service domain. Significant outcomes of the project are domain dependent speech databases (eg the database DOVLAS for the Czech language) in digital form for each of the languages. Since the transliteration and standard pronunciation of each utterance as well as an automatically derived time alignment are available as well, the data can serve as the basis for bootstrapping a recognition module for any other application in these 3 languages. These data are made available to the European Speech Community via compact disc read only memory (CDROM).


