Skip to main content

The role of visual cues in speech segmentation and the acquisition of word order: a study of monolingual and bilingual adults and infants

Final Report Summary - VISCUESACQWO (The role of visual cues in speech segmentation and the acquisition of word order: a study of monolingual and bilingual adults and infants)

Summary description of the project objectives
The VisCuesAcqWO project investigates the role of prosody and visual information in speech segmentation and the acquisition of syntax. Three are the main aims of the present project:
(1) To determine whether visual facial information accompanies a specific type of prosodic information, i.e. phrasal prominence, that correlates with the natural languages’ basic word order and has thus been proposed as potentially allowing prelexical infants to discover this major syntactic feature of natural languages (Gervain & Werker 2013).
(2) To investigate the role of the potential visual facial gestures accompanying the prosodic patterns marking word order, in speech segmentation and the acquisition of word order in infancy.
(3) To investigate whether the relative weight of visual facial information changes throughout development by examining its role in adulthood.
These research activities have been complemented with mentoring and dissemination activities, as described in the work plan approved for VisCuesAcqWO.

Description of the project performed and main results and conclusions
Speech is audiovisual: visual information plays an important role in the perception of auditory speech both in infancy and adulthood. However, its potential influence on particular aspects of the auditory speech signal, such as auditory prosody, and its role in language acquisition remains largely unexplored and is the main goal of the present project. A specific type of prosodic information, i.e. the acoustic realization of phrasal prominence, correlates with the natural languages’ word order of verbs and objects. In O(bject)-V(erb) languages, the prominent element has typically higher pitch and/or intensity (Japanese: high-low, [‘To]kyo ni), whereas in V(erb)-O(bject) languages the prominent element is typically lengthened (English: short-long, to [Ro]me). Phrasal prominence might thus help infants learn the basic word order of their native language(s). Indeed, seven-month-old infants can use phrasal prominence to segment unknown artificial languages (Gervain & Werker 2013). VisCuesAcqWO seeks to examine whether co-verbal visual facial gestures such as eyebrow movements and head nods, are available to prelexical infants which, together with auditory prosody―i.e. phrasal prominence―help them locate the boundaries of phrases and discover basic word order.
To that end, a production study was first conducted with adult native talkers of English and Japanese, languages with opposite word orders. Participants were videotaped producing target utterances in both Adult- and Infant-Directed Speech (ADS and IDS), and their acoustic and facial gesture information — eyebrow movements and head nods — was measured. Manual annotation combined with Optical Flow Analysis (Barbosa et al., 2008) of eyebrow and head movements revealed the presence of audio-visual information signaling the boundaries of phrases across languages and speech styles. Thus, the starts and peaks of eyebrow movements occurred significantly more often in the first element of the target phrases, whereas the ends of peaks and ends of eyebrow movements occurred more frequently in their second element and in combination with head nods. Interestingly, these visual gestures were more reliable, pronounced or frequent in IDS as compared with ADS, that is, in the speech directed to learners undergoing acquisition.
This available multimodal information might thus allow prelexical infants to attune to the basic word order of their language(s). However, whether infants are sensitive to and make use of these visual facial gestures remained to be determined. Therefore, in two series of experiments, we examined whether the presence of visual facial information modulated or determined the segmentation preferences of an ambiguous artificial language of: (i) prelexical infants, and (ii) adult monolinguals and bilinguals.
We created a series of ambiguous languages which contained the prosodic cues associated to word order, i.e. changes in pitch or duration, in addition to visual information — specifically head nods — displayed by means of a computer-generated avatar of a face. Crucially, we presented infants with either (i) aligned visual and prosodic information: the head nods peaked at the prosodically prominent — long or higher pitched — syllables, or (ii) misaligned visual and prosodic information: the head nods peaked at the prosodically non-prominent — short or lower pitched — syllables. We first exposed the infants’ to a familiarization stream of one of the structurally ambiguous artificial languages, which had two possible segmentations. Infants were then presented with test trials that contained phrases in these two possible segmentations and we measured their looking behavior using an eyetracking technique. Preliminary analysis revealed no segmentation preference of the artificial languages in test, which might result from having too complex visual stimuli, due to the unnaturalness of the avatar and its potential non-biological motion. We are currently conducting a combined Optical Flow and off-line frame-by-frame coding of the videos of the infant participants, which, in combination with the eye-tracking measurement analysis, will allow us to gain a full picture of the infants’ behavior during familiarization and test, and observe potential patterns during the scanning of the avatar’s face. Last, the patterns of brain activity of 7-month-old infants was measured during presentation of an auditory-only artificial language, using Near InfraRed Spectroscopy (NIRS), which revealed a habituation response to the language during familiarization manifested as an inverted neural response (of the oxygenated and deoxygenated hemoglobin signals).
In addition, we examined the role of visual information and its interplay with other available sources of information in speech segmentation in adulthood. We created 7 artificial languages containing different combinations of statistical and/or prosodic and/or visual cues (head nods), aligned or misaligned. Seven groups of bilinguals and 5 groups of English monolinguals were familiarized in one of the seven structurally ambiguous artificial languages, followed by a test phase in which they chose between pairs of sequences with opposite segmentations. The results showed that adult monolinguals and bilinguals use statistical and prosodic information to segment unknown languages but their use of visual information is more limited, suggesting that visual cues are less weighted in the hierarchy of available segmentation cues when presented with intact speech.
This investigation will advance our understanding of early linguistic development and the role of visual information in the cognitive mechanisms involved in speech segmentation and the acquisition of syntax. Examining infants of different age groups in addition to adult participants, not only monolingual but also bilingual, allows us to better understand the developmental trajectory and potential changes in the relative weight given to visual facial gestures by these different populations. Further, establishing the mechanisms that enable such early discovery of the syntax of the native language would in turn allow us to uncover important milestones in typical development. Such redundant multimodal cues might be of particular importance in some language situations, e.g. for bilingual infants growing up with two languages that have particularly different grammars, such as Japanese and English, or Spanish and Basque. A better understanding of early linguistic development, particularly in a bilingual context, has far-reaching societal implications in today’s multilingual and multicultural societies, e.g. for education.

Barbosa, A.V. Yehia, H.C. & Vatikiotis-Bateson, E. (2008). Linguistically Valid Movement Behavior Measured Non-Invasively. In R. Göcke, P. Lucey & S. Lucey (Eds.) Proceedings of the International Conference on Auditory-Visual Speech Processing, Moreton Island, Australia, Causal Productions, 173–177.
Gervain, J., & Werker, J.F. (2013). Prosody cues word order in 7-month-old bilingual infants. Nature Communications, 4, 1490.