Servicio de Información Comunitario sobre Investigación y Desarrollo - CORDIS

Speech Driven Multi-modal Automatic Directory Assistance (SMADA): Noise robustness experiments

The results of noise robustness experiments not immediately pertaining to the context of the ETSI Aurora proposal:

- Denoising techniques:
The performance of the Loquendo Automatic Speech Recognition system was evaluated with MFCCs, JRASTA Perceptual Linear Prediction Coefficients (JRASTA PLP), or energies from a Multi Resolution Analysis (MRA) tree of filters as input features to a hybrid system consisting of a Neural Network (NN) which provides observation probabilities for a network of Hidden Markov Models (HMM).

An investigation was performed on the use of denoising techniques in the time domain applied to the outputs of filters corresponding to a Multi Resolution Analysis. The fact that energies of denoised samples are used for Automatic Speech Recognition (ASR) makes soft thresholding particularly attractive, especially if Principal Component Analysis (PCA) is applied to the whole tree of energy features. This consideration is supported by experimental results on a very large test set including many speakers uttering proper names from different locations of the Italian public telephone network.

The results show that soft thresholding outperforms JRASTA PLP with a WER reduction, after denoising, of 26%.

Other experiments in noisy conditions have shown a WER reduction of 15.7% when Signal-to-Noise Ratio dependent Spectral Subtraction (SS) is performed on MRA-PCA features compared to when it is performed on JRASTA PLP features. Furthermore, SS appears to be better than Soft Thresholding, which still slightly outperforms denoising with JRASTA PLP features.

In another study, the Loquendo hybrid ANN/HMM architecture for denoising and observation transformation was used to investigate performance degradation when an ANN trained on balanced phonetic sentences and in normal telephone conditions is used to recognize a specific small vocabulary in a car noise condition.

Loquendo has filed a patent request for the new feature system.

- Noise robust features:
KUN studied how the state-of-the-art representation to describe the spectral shape of a short time spectrum of speech, i.e., Mel-frequency cepstral coefficients (MFCCs) should be computed for maximum recognition performance in noise test conditions: Should the MFCCs be derived from a fast Fourier transform (FFT) or from a linear predictive coding (LPC) algorithm to compute the short time spectrum? To better identify the conditions for which one representation could be expected to be preferred over the other this work was continued in 2002. Using the full Aurora2 database for these experiments, the conditions for which one representation can be expected to be preferred over the other could be successfully identified. In addition, the results enabled to quantify the more general (but, as a consequence, rather imprecise) rules about preference for FFT-based MFCCs and LPC-based MFCCs, that were found in the literature.

Finally, from a detailed analysis of the data obtained in previous experiments with noise robust feature extraction, it appeared that linear transformation of noisy features is not always sufficient. Therefore, KUN studied non-linear transformations, based on a technique that is known as Histogram Normalisation. The first experiments with Histogram Normalisation showed that recognition performance on the Aurora2 test set is determined to a large extent by the match of the energy feature in the training and test conditions.

Histogram Normalisation effects a match that is at least as good as what can be obtained with time-domain noise reduction.

This work was further extended to include unprepared continuous speech utterances and was reported in a paper submitted to a journal.

In the work for noise robustness, KUN found that instruments were lacking to study training-test mismatch, or model-data mismatch in ways that were deemed necessary for the search. More precisely, an evaluation tool was deemed necessary that enables one to study the effects of model-data mismatch for individual feature vector components. To extend the instruments for studying model-data mismatch, software was developed that enabled to study the effects of model-data mismatch in terms of the overall mismatch between distributions of acoustic features for a test set and the distributions as seen during training, so that the mismatch can be quantified for each acoustic feature vector component individually.

- Noise reduction techniques:
Work has also been done to make front-end more robust against noise. Several noise reduction techniques were investigated: one was working in the temporal domain (signal filtering) and the other in spectral domain (noise substraction).

Moreover, long noisy silences are very harmful during the speech recognition process, especially if the current noise was not observed during the training phase (i.e. not present in the training data). Hence techniques were investigated for detecting and removing frames corresponding to non-speech (noise or silence) signal before running the speech decoder.

These approaches improved appreciably recognition results.

More information on the SMADA project can be found on the project’s website:

Reported by

Politecnico di Torino - DAUIN
C. Duca degli Abruzzi, 24
10143 Torino
See on map
Síganos en: RSS Facebook Twitter YouTube Gestionado por la Oficina de Publicaciones de la UE Arriba