Service Communautaire d'Information sur la Recherche et le Développement - CORDIS

Speech Driven Multi-modal Automatic Directory Assistance (SMADA): Acoustic confidence measures

Recognition of a large vocabulary of proper names is a difficult task of a very high perplexity. Moreover, practical applications require a low false automation rate, while, in many cases a certain amount of false rejections can be tolerated. A suitable dialog strategy can substantially reduce the false automation rate if the Word Error Rate (WER) on proper name recognition is kept low.

Research on the European project SMADA, involving systems deployed by telephone companies for very large directories in different countries has shown that there is a need of an elaborate and effective decision strategy which ensures that when a system makes a decision, the WER is sufficiently low and when an utterance is rejected, the probability that the recogniser would have output the right hypothesis is also low.

In principle, a good strategy should evaluate an input with an initial set of ASR systems and produce an indication of acceptance or rejection and, for each case, suitable new processes which may involve specialized discriminative recognisers should be executed for refining the confidence in a decision until the confidence is so high that the phase known in computer transaction as commit can be reached.

Specialized recognition processes use different acoustic features, different knowledge sources, different search algorithms, different scores and different models and each process can make an optimal set of decisions according to a given decision theory.

As decoders may use models of different precision, the decision strategy may consider combinations of hypothesis scores obtained by different decoders, but it can also reason about ranking of decoder outputs and their performance statistics.

Taking into account all possible distortions of proper names is impractical, because the size of the lexicon is very high and the content of the lexicon is periodically and dynamically updated.

A methodology is introduced, based on the above considerations, for rescoring the N-best hypotheses generated, after a short dialogue, by a system developed at France Telecom R&D for the recognition of proper names pronounced in isolation and belonging to the whole French directory.

France Telecom R&D has worked on the combination of various elementary confidence measures. A detailed analysis of various confidence measures showed that they behave differently for what concerns rejection of incorrect data on various field data subsets (substitution errors, out-of-vocabulary data & noise tokens) collected from a vocal directory task. It was not the same individual confidence measure that provides the best performance on every data subset. Hence, combination methods were investigated. One combines confidence measures by means of a neural network and the other through logistic regression. Evaluations showed that both combination techniques are efficient, and both take the best of the various individual confidence measures involved on each data subset. This approach was reported in Euro speech 2001.

The successful deployment of a telephone speech-driven application does not only rely on the accuracy of the recognition results, but also on their reliability. Reliable confidence measures are, thus, necessary in all practical applications to decide whether a recognized word - or sentence - should be accepted or rejected. As confidence measures play a crucial role in DA, they were investigated in this work package.

Since most of the applications are based on continuous speech recognition, controlled by grammars, Politecnico and Loquendo proposed an application independent word confidence scoring technique that allows good performance to be obtained across six different grammars that can be embedded in several applications.

Also, an original sentence level acoustic likelihood ratio measure was proposed to detect ill-formed sentences that do not include Out Of Vocabulary words (“quasi” well formed), because such sentences cannot be easily rejected using word confidence measures only. Finally, a rejection strategy was devised that gave 96% rejection for ill formed utterances, 92% rejection for “quasi” well formed sentences, accepting a 5% rejection of in-grammar utterances.

The research conducted by KUN focused on the impact of two different causes of ASR errors on the most suitable confidence measure:
- Confusion of acoustically similar names/words (e.g. names such as ‘Maarn’ and ‘Baarn’ differ only in one phonetic feature in the first sound). It appears that confidence measures that take only acoustic likelihood into account perform best in this situation.

- Problems caused by background noise or unclear articulation. For these problems confidence measures based on the proportion of the probability mass of the first-best hypothesis relative to the competing hypotheses are most appropriate.

More information on the SMADA project can be found on the project’s website:

Informations connexes

Reported by

LIA - Laboratoire Informatique d’Avignon - CNRS
84911 Avignon
See on map