Speech Driven Multi-modal Automatic Directory Assistance

Recognition of a large vocabulary of proper names is a difficult task of a very high perplexity. Moreover, practical applications require a low false automation rate, while, in many cases a certain amount of false rejections can be tolerated. A suitable dialog strategy can substantially reduce the false automation rate if the Word Error Rate (WER) on proper name recognition is kept low. Research on the European project SMADA, involving systems deployed by telephone companies for very large directories in different countries has shown that there is a need of an elaborate and effective decision strategy which ensures that when a system makes a decision, the WER is sufficiently low and when an utterance is rejected, the probability that the recogniser would have output the right hypothesis is also low. In principle, a good strategy should evaluate an input with an initial set of ASR systems and produce an indication of acceptance or rejection and, for each case, suitable new processes which may involve specialized discriminative recognisers should be executed for refining the confidence in a decision until the confidence is so high that the phase known in computer transaction as commit can be reached. Specialized recognition processes use different acoustic features, different knowledge sources, different search algorithms, different scores and different models and each process can make an optimal set of decisions according to a given decision theory. As decoders may use models of different precision, the decision strategy may consider combinations of hypothesis scores obtained by different decoders, but it can also reason about ranking of decoder outputs and their performance statistics. Taking into account all possible distortions of proper names is impractical, because the size of the lexicon is very high and the content of the lexicon is periodically and dynamically updated. A methodology is introduced, based on the above considerations, for rescoring the N-best hypotheses generated, after a short dialogue, by a system developed at France Telecom R&D for the recognition of proper names pronounced in isolation and belonging to the whole French directory. France Telecom R&D has worked on the combination of various elementary confidence measures. A detailed analysis of various confidence measures showed that they behave differently for what concerns rejection of incorrect data on various field data subsets (substitution errors, out-of-vocabulary data & noise tokens) collected from a vocal directory task. It was not the same individual confidence measure that provides the best performance on every data subset. Hence, combination methods were investigated. One combines confidence measures by means of a neural network and the other through logistic regression. Evaluations showed that both combination techniques are efficient, and both take the best of the various individual confidence measures involved on each data subset. This approach was reported in Euro speech 2001. The successful deployment of a telephone speech-driven application does not only rely on the accuracy of the recognition results, but also on their reliability. Reliable confidence measures are, thus, necessary in all practical applications to decide whether a recognized word - or sentence - should be accepted or rejected. As confidence measures play a crucial role in DA, they were investigated in this work package. Since most of the applications are based on continuous speech recognition, controlled by grammars, Politecnico and Loquendo proposed an application independent word confidence scoring technique that allows good performance to be obtained across six different grammars that can be embedded in several applications. Also, an original sentence level acoustic likelihood ratio measure was proposed to detect ill-formed sentences that do not include Out Of Vocabulary words (quasi well formed), because such sentences cannot be easily rejected using word confidence measures only. Finally, a rejection strategy was devised that gave 96% rejection for ill formed utterances, 92% rejection for quasi well formed sentences, accepting a 5% rejection of in-grammar utterances. The research conducted by KUN focused on the impact of two different causes of ASR errors on the most suitable confidence measure: - Confusion of acoustically similar names/words (e.g. names such as Maarn and Baarn differ only in one phonetic feature in the first sound). It appears that confidence measures that take only acoustic likelihood into account perform best in this situation. - Problems caused by background noise or unclear articulation. For these problems confidence measures based on the proportion of the probability mass of the first-best hypothesis relative to the competing hypotheses are most appropriate. More information on the SMADA project can be found on the projects website: http://smada.rd.francetelecom.com/.

The SMADA project aimed at an improved functionality and usability of automated services that use automatic speech recognition (ASR) in their user interface, either as the only input/output modality (i.e. over the telephone) or as one of the modalities in multi-modal interfaces. The results of the Human Factors experiments in the Netherlands are described below. The HF experiments carried out by Katholieke Universiteit Nijmegen (KUN) focused on the usability of speech driven multimodal interaction. KPN and KUN experimented with a combination of speech and pen input in applications where users must complete a form that requires textual input, for example for names. If the list of possible names is very large, browsing through a list is not attractive. Instead, one would like to be able to enter a name directly, either via speech or via some kind of soft keyboard if a full-fledged keyboard is not available. The most important results of the Human Factors experiments conducted by KPN and KUN can be summarised as follows: - Multimodal applications are best designed from scratch. Deriving these applications from existing graphics or speech-only services tends to result in sub-optimal designs. - ASR systems used in multimodal form filling applications must have specific functionalities. It is especially important that the lexicon and language models can be adapted on-line, to optimally cater for the input that is expected for a specific field in the form. - Uniformed users do not spontaneously understand the way in which they can combine speech and pen input. For multimodal services to develop rapidly a large degree of standardisation of the interfaces for such services is highly desirable. - Users prefer speech input over alternative input mechanisms if a full fledged keyboard is not available. However, in case of persistent ASR errors they learn how to use alternative input methods to advantage. - Allowing users to select the first letter of a name through a soft keyboard and subsequent re-processing the last spoken input utterance, appears to be a very powerful method for dealing with ASR errors. More information on the SMADA project can be found on the projects website: http://smada.rd.francetelecom.com/.

The results of noise robustness experiments not immediately pertaining to the context of the ETSI Aurora proposal: - Denoising techniques: The performance of the Loquendo Automatic Speech Recognition system was evaluated with MFCCs, JRASTA Perceptual Linear Prediction Coefficients (JRASTA PLP), or energies from a Multi Resolution Analysis (MRA) tree of filters as input features to a hybrid system consisting of a Neural Network (NN) which provides observation probabilities for a network of Hidden Markov Models (HMM). An investigation was performed on the use of denoising techniques in the time domain applied to the outputs of filters corresponding to a Multi Resolution Analysis. The fact that energies of denoised samples are used for Automatic Speech Recognition (ASR) makes soft thresholding particularly attractive, especially if Principal Component Analysis (PCA) is applied to the whole tree of energy features. This consideration is supported by experimental results on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network. The results show that soft thresholding outperforms JRASTA PLP with a WER reduction, after denoising, of 26%. Other experiments in noisy conditions have shown a WER reduction of 15.7% when Signal-to-Noise Ratio dependent Spectral Subtraction (SS) is performed on MRA-PCA features compared to when it is performed on JRASTA PLP features. Furthermore, SS appears to be better than Soft Thresholding, which still slightly outperforms denoising with JRASTA PLP features. In another study, the Loquendo hybrid ANN/HMM architecture for denoising and observation transformation was used to investigate performance degradation when an ANN trained on balanced phonetic sentences and in normal telephone conditions is used to recognize a specific small vocabulary in a car noise condition. Loquendo has filed a patent request for the new feature system. - Noise robust features: KUN studied how the state-of-the-art representation to describe the spectral shape of a short time spectrum of speech, i.e., Mel-frequency cepstral coefficients (MFCCs) should be computed for maximum recognition performance in noise test conditions: Should the MFCCs be derived from a fast Fourier transform (FFT) or from a linear predictive coding (LPC) algorithm to compute the short time spectrum? To better identify the conditions for which one representation could be expected to be preferred over the other this work was continued in 2002. Using the full Aurora2 database for these experiments, the conditions for which one representation can be expected to be preferred over the other could be successfully identified. In addition, the results enabled to quantify the more general (but, as a consequence, rather imprecise) rules about preference for FFT-based MFCCs and LPC-based MFCCs, that were found in the literature. Finally, from a detailed analysis of the data obtained in previous experiments with noise robust feature extraction, it appeared that linear transformation of noisy features is not always sufficient. Therefore, KUN studied non-linear transformations, based on a technique that is known as Histogram Normalisation. The first experiments with Histogram Normalisation showed that recognition performance on the Aurora2 test set is determined to a large extent by the match of the energy feature in the training and test conditions. Histogram Normalisation effects a match that is at least as good as what can be obtained with time-domain noise reduction. This work was further extended to include unprepared continuous speech utterances and was reported in a paper submitted to a journal. In the work for noise robustness, KUN found that instruments were lacking to study training-test mismatch, or model-data mismatch in ways that were deemed necessary for the search. More precisely, an evaluation tool was deemed necessary that enables one to study the effects of model-data mismatch for individual feature vector components. To extend the instruments for studying model-data mismatch, software was developed that enabled to study the effects of model-data mismatch in terms of the overall mismatch between distributions of acoustic features for a test set and the distributions as seen during training, so that the mismatch can be quantified for each acoustic feature vector component individually. - Noise reduction techniques: Work has also been done to make front-end more robust against noise. Several noise reduction techniques were investigated: one was working in the temporal domain (signal filtering) and the other in spectral domain (noise substraction). Moreover, long noisy silences are very harmful during the speech recognition process, especially if the current noise was not observed during the training phase (i.e. not present in the training data). Hence techniques were investigated for detecting and removing frames corresponding to non-speech (noise or silence) signal before running the speech decoder. These approaches improved appreciably recognition results. More information on the SMADA project can be found on the projects website: http://smada.rd.francetelecom.com/.

The SMADA project aimed at an improved functionality and usability of automated services that use automatic speech recognition (ASR) in their user interface, either as the only input/output modality (i.e. over the telephone) or as one of the modalities in multi-modal interfaces. The results of experiments on modelling user formulation variants include: - Generating business entry variants: The analysis of the traffic has shown that about 80% of the Directory Assistance DA customer accesses are related to business listings. Thus, it is important to improve the percentage of success of the automatic system for this class of calls. Directory Assistance for business listings, however, is a challenging task: one of its main problems is that customers formulate their requests for the same listing with great variability. Since the content of the original records in the database does not, typically, match the linguistic expressions used by the callers, a complex processing step is needed for deriving a set of possible formulation variants (FVs) from each original records in the listing book. A large percentage of user expressions, however, still remain uncovered by the FV database. Thus, we have proposed a procedure for detecting, from field data, user formulations that were not foreseen by the designers. These formulations can be added, as variants, to the denominations already included in the system to reduce its failures. Our approach is based on partitioning the field data into phonetically similar clusters from which new user formulations can be derived. Our working hypothesis, confirmed by the experimental results, was that collecting a large number of requests for the same denomination, there is high probability of obtaining clusters of phonetically similar strings, characterized by high cardinality and small dispersion of the included strings, whose central elements, defined as the string that has the minimum sum of the distance from all the other elements of the cluster, are quite accurate phonetic transcriptions of (possibly new) user formulations. During the project we collected tens of millions of phonetic strings referring to business listings routed to the operators because the automatic system was unable to terminate the transaction with the customer. Our procedure is able to filter a huge amount of calls routed to the operators, and to detect a limited number of phonetic strings that can be inspected by human operators, easily transcribed orthographically, and associated with the corresponding phone number. This approach has been used to update the system vocabulary giving a significant reduction of the system failures on a field test set. - Dealing with lexical variants for proper name recognition: Recognition of a large vocabulary of proper names is a difficult task of a very high perplexity. A suitable dialog strategy can substantially reduce the false automation rate if the Word Error Rate (WER) on proper name recognition is kept low. In principle, a good strategy should evaluate an input with an initial set of ASR systems and produce an indication of acceptance or rejection and, for each case, suitable new processes which may involve specialized discriminative recognisers should be executed for refining the confidence in a decision until the confidence is so high that the phase known in computer transaction as ?commit? can be reached. Of particular interest for DA are decoders based on lexical models that account for distortions of a canonical pronunciation as they appear in surface phonetic representations of words. Search based on a network with all possible distortions of canonical forms may lead to an increase in word error rate because the knowledge used includes a large number of distortion models which are inconsistent with the distortion types introduced by a given speaker. A methodology has been introduced, based on the above considerations, for rescoring the N-best hypotheses generated, after a short dialogue, by a system developed at France Telecom R&D for the recognition of proper names pronounced in isolation and belonging to the whole French directory. A blackboard-based architecture has been proposed for scheduling the execution of different recognition processes using different lexical models. Using this architecture, a consensus based verification strategy has been developed and tested with a French directory of more than 100,000 entries. Results have shown much better performance with respect to the use of posterior probability. A journal paper is in preparation on this topic. - Generating pronunciation variants for city-names recognition: Recognizing city-names is mandatory in many applications such as directory assistance, tourism information, etc. However this task is quite difficult in France as it implies a large vocabulary (40,000 city-names). Furthermore, some names are short monosyllabic words, while other ones, such as long official compound-names, are frequently abbreviated in shorter common names. Hence rules were defined to predict automatically short abbreviated common names, in order to add those extra variants in the recognition vocabulary. The principle of this rule-based approach has been published at ICASSP 2003. More information on the SMADA project can be found on the projects website: http://smada.rd.francetelecom.com/.

In 1999, a standardisation in ETSI started on how speech recognition could be improved over mobile networks. The intention was to lower the influence of noisy environments, bandwidth limitations, codec effects and transmission errors, which decrease the recognition performance. A major leap has been gapped by introducing distributed processing of the acoustic data, i.e. generating the feature vectors of the signal in the terminal and transmitting these vectors to a backend, which performs the remaining decoding in the recognition process. Distributed Speech Recognition (DSR) could overcome most of the restrictions mentioned before and can - by using an excellent noise-reduction method - improve the quality of the speech signal. SMADA has contributed to this standardisation through France Telecom and Alcatel by providing algorithms for noise robust feature extraction in a first evaluation round, which had so far the best performance. For the second round the consortium teamed up with another ETSI member and developed together the winning proposal for the noise-robust front-end. This reduces the recognition errors in a defined evaluation set by more than 50% against standard mel-cepstrum feature extraction. The result has become in October 2002 a formal ETSI standard and is currently discussed in 3GPP for implementation as codec for speech-enabled services in 3G networks. This result of SMADA will have major commercial value in the near and mid-term future when speech-enabled and multimodal services become available in 2,5-3 G mobile networks. It is foreseeable that after 2007 new mobile phones will be equipped with these software modules similar to today's WAP browsers. See: http://webapp.etsi.org/workprogram/Report_WorkItem.asp?WKI_ID=6402. More information on the SMADA project can be found on the project�s website: http://smada.rd.francetelecom.com/.

Rezultaty

Pobierz Pobierz zawartość strony