A physiologically based computational model of the processing of the sounds by the cochlear nerve and cochlear nucleus has been developed. The first stage is a series of filters based on simulated reverse correlation functions as measured in the cochlear nerve. The output of these filters is then processed by a stage adding a human audiogram and an absolute hearing threshold across the array of channels. This array of signals is then processed by a model of hair cell transduction and finally spikes are generated by a probabilistic spike generator. The resulting spike trams are processed by tonotopically organized arrays of models of nucleus neurones. It was found that the recognition scores obtained with the auditory model recognizer showed the same trends as were found with the human preception experiments.
An investigation has been made as to how certain temporal feature detectors which are found in the auditory brainstem could be used as a mechanism for reducing high resoultion data to a manageable level while minimising the loss of speech information. Results obtained are consistent with existing knowledge concerning the distribution of vowel and plosive information. Mutual information (MI) maps and recognition results indicate that on and off positions locate both of the kdy sources of information for consonant discrimination in vowel plosive vowel (VPV) context, and that patterns around burst release. Vowel information is both stronger than for consonants and also more extensive in time and therefore easier to locate. On and off detectors can be used to focus phone me recognition on temporal intervals in the outputs from models of the peripheral auditory system which contain the greatest concentration of phonetic information.
A realistic computational model of the auditory system is to be developed based on contemporary physiological knowledge. This model will be tested as an analyser in an automatic speech recognition system. The performance of the system will be assessed and compared with conventional recognisers and human listeners at different signal-to-noise ratios.