Skip to main content

Interdisciplinary investigation of Pitch coding in whispered speech

Final Report Summary - INTERPITCH (Interdisciplinary investigation of Pitch coding in whispered speech)

INTERPITCH – 254618

Final publishable summary report

Introduction

In normal (i.e. phonated) speech, the movement of the vocal folds generates a harmonic signal at the speaker's fundamental frequency (F0). This harmonic signal evokes a clear pitch at the fundamental frequency, that crucially contributes to the communication of tone and intonation, but also to conveying speaker gender, emotion and intention, across many languages. As opposed to phonated speech, the vocal folds are fixed in whispered speech, and as a result there is no F0. Though perception of intonation is generally better in phonated than in whispered speech, the lack of F0 -the main perceptual cue- still allows listeners to perceive tone differences in whisper. The goal of this study was to systematically investigate how intonation is coded in whispered speech. To do so, an interdisciplinary approach was taken combining psychoacoustics, auditory modelling and phonetics, as both fields of research differently contribute to understanding the processes involved in speech communication.

Method

To answer the research questions whether and how whispering speakers adapt their output to transferring intonation, parallel production and perception data were collected of both normal and whispered speech. We have taken the case of boundary tones, signaling either an interrogative (i.e. question) or an affirmative (i.e. statement).

Speech perception was explored by dividing the speech spectrum into a number of spectral regions, and studying the relative information content and influence of each of those spectral regions on pitch perception. These spectral regions (i.e. their cutoff frequencies) were determined by taking into account the known limits of the auditory system that constrain pitch perception for harmonic signals. Various vowel contexts were included in the perception experiments to explore possible variation in expressing whispered intonation as well as restrictions set by the speech context in which the intonation is expressed.

In addition, we studied which acoustic cues could carry the information relevant to intonation perception. Speech production was explored by passing all materials of the perception experiment through auditory models, that is computational models of the peripheral auditory system, followed by acoustic analysis of the signal characteristics.

Results

The perceptual tests have shown that:

1.Different frequency regions contain the main cues to pitch percepts in normally phonated versus whispered speech. The lower frequency range, say under 1000 Hz, was most informative in normal speech, whereas higher frequencies, say over 1500 Hz, were most informative in whispered speech.
2.Vowel quality affects listener performance, and the available auditory cues per frequency region vary between vowels. This may be explained by formant information, but does not seem to be restricted to just the first two formants as has mainly been concluded in earlier literature (main experiment), and does not seem localized in the second or third formant (control experiment).
3.Listeners abstract from vowel quality, i.e. discriminate high /u/ from low /i/, which suggests that identifiability of the target vowel contributes to intonation perception in whispered speech.

Auditory modeling and acoustic analysis have shown that:

4.Temporal modeling has excluded the presence of periodicity pitch in whispered speech.
5.Spectral modeling has shown that pitch percepts in whisper are spectral in nature.
6.A number of measures derived from the excitation patterns seem to contribute to intonation production in whispered speech. For instance,
(a) interrogatives have more spectral energy,
(b), interrogatives show higher spectral maxima,
(c) spectral maxima in interrogatives tend to occur at higher frequencies, and
(d) spectral slope in affirmatives is more falling/less rising than in interrogatives.

Dynamic versions of these measures, taken by comparing them at 20% and 60% into the vowel, generally do not vary with intonation condition. Hence, production of whispered intonation in French does not seem to depend on the speaker conveying rising (in interrogatives) versus falling intonation (in affirmatives), which contrasts with what speakers do in phonated speech, where a rising versus falling F0 are deemed characteristic of boundary tones.

How can information in speech production explain the listeners' perception? Listener performance significantly correlated with excitation pattern-derived measures, such as slope, level of the highest peak against the background, and difference in centre of gravity between high and low versions of the same vowel. However, correlations were only low to moderate. The finding that listeners can abstract from vowel quality suggests that identifiability of the target vowel contributes to intonation perception in whispered speech. Vowel formants, i.e. resonance frequencies associated with particular vowels, may contribute to explaining the performance differences between vowels and between frequency regions, as vowels vary in formant locations and frequency regions vary in formant availability. The good performance in the condition above 1500-2000 Hz suggests this goes beyond the first and second formant that are typically thought to identify vowels.

Contribution

This research contributes to

(a) hearing sciences through a better understanding of the cues that may code pitch in higher frequencies of the speech spectrum, which may lead to new insights for coding pitch in speech communication for hearing impaired listeners, such as CI users;
(b) linguistics/phonetics by showing alternative cues to intonation that seemingly go beyond a vowel's first two formants, thereby contributing to the question which variation in sound results in which change in (linguistic) meaning;
(c) speech technology by showing ways of improving the coding of intonation in whispered speech synthesis.

Contact details

prof. dr. C. Lorenzi lorenzi@ens.fr
dr. W. Heeren willemijn.heeren@ens.fr / w.f.l.heeren@hum.leidenuniv.nl

Equipe Audition

Ecole Normale Superieure
Paris Sciences and Lettres, Paris, France