Skip to main content

Production and perception of prosodic cues to speech segmentation: multisensorial aspects

Final Activity Report Summary - SPEECHSEG (Production and perception of prosodic cues to speech segmentation: multisensorial aspects)

Among the questions addressed was how people engaged in a conversation know where words begin and end. Speech is continuous; it does not have the convenient spaces that separate words in written language. Nevertheless, we generally have no problem in finding the words in our native language.

Listeners use many cues in the non-trivial task of speech segmentation. These include a variety of prosodic cues and other language-specific patterns. For example, if an English-speaking listener hears 'mn', she knows that there must be a word break, since English words cannot start with 'mn'. French listeners use the intonation or distinctive pitch patterns of their language as cues of segmentation. The presence of a rise in fundamental frequency (F0) helps them find content words, such as nouns, verbs, etc. F0, the rate of vocal fold vibration, is the primary pitch correlate. The temporal relationship of high and low points, i.e. tones, in the F0 pattern with units on the segmental level like consonants, vowels, and syllables, i.e. tonal alignment, is also important.

Dr Welby and her colleagues investigated French tonal alignment, finding consistent patterns, which were not observed in other languages. They also studied tonal alignment from an articulatory point of view, examining the relationship between lip and tongue gestures and F0. They also examined acoustic, i.e. intonational, formant and duration, cues to speech segmentation, using pairs like 'l' affiche and the poster' or 'la fiche and the sheet', which had the same sequences of consonants and vowels. For example, there was sometimes an F0 rise at the start of 'l' affiche and the poster'. Listeners distinguished between pairs like these even before the end of the sequence, suggesting that acoustic differences helped listeners in speech segmentation, even at the earliest stages of processing.

Furthermore, they examined whether word boundaries cues such as F0 rises were enhanced in noisy environments, in which listeners might have difficulties in finding word boundaries. This was important because many conversations take place in some sort of noise, such as children playing, cars passing and wind whistling. However, speakers adapt, generally speaking louder, slower, and in a higher pitch, and some of these changes make speech easier to understand. Dr Welby and her colleagues recorded speech in quiet conditions, in white noise and in 'cocktail party' noise and examined intonational and articulatory characteristics of the speech. The results showed that some speakers produced more intonational rises in noisy conditions, suggesting that they might have altered their speech to provide their listeners with cues to content word, e.g. noun, verb, etc, beginnings. In addition, the differences in F0 between quiet and noisy conditions observed for French differed from those observed in another study on Dutch. This difference pointed to the importance of taking into account language-specific differences when studying speech in noise (Lombard speech).

Dr Welby and her colleagues used a special lip-tracking system to examine the articulation of speech in noise. They found evidence of hyper-articulation, at least in some conditions. This could be useful to the listener and viewer in segmenting speech. We know that listeners use visual cues as well as auditory cues in other areas of speech perception and that this allows speech to be a robust medium, even in difficult speaking conditions. For example, even in a noisy train station, listeners can distinguish the word mom from the word Tom by the closing of the lips at the beginning of the word. The idea that listeners might use similar articulatory or visual cues to help them segment speech seemed plausible, given that researchers had preliminary evidence that some types of intonation patterns had articulatory or visual correlates.

Beyond their contribution to theoretical issues, the results had several potential practical applications, for example in the development of 'smart' speech technologies that detected the presence of noise and adapted to it.