Community Research and Development Information Service - CORDIS



Project ID: 624972
Funded under: FP7-PEOPLE
Country: France

Periodic Report Summary 1 - VISCUESACQWO (The role of visual cues in speech segmentation and the acquisition of word order: a study of monolingual and bilingual adults and infants)

• Summary description of the project objectives
The VisCuesAcqWO project investigates the role of prosody and visual information in speech segmentation and the acquisition of syntax. Three are the main aims of the present project:
(1) To determine whether visual facial information accompanies a specific type of prosodic information, i.e., phrasal prominence, that correlates with the natural languages’ basic word order and has thus been proposed as potentially allowing prelexical infants to discover this major syntactic feature of natural languages (Gervain & Werker 2013).
(2) To investigate the role of the potential visual facial gestures accompanying the prosodic patterns marking word order, in speech segmentation and the acquisition of word order in infancy.
(3) To investigate whether the relative weight of visual facial information changes throughout development by examining its role in adulthood.
These research activities have been complemented with mentoring and dissemination activities, as described in the work plan approved for VisCuesAcqWO.

• Description of the work performed since the beginning of the project and main results
Speech is audiovisual from early in infancy. Visual information has been shown to play an important role in the perception of auditory speech both in infancy and adulthood. However, its potential influence on particular aspects of the auditory speech signal, such as auditory prosody, and its role in language acquisition remains as yet unexplored and is the main goal of the present project. A specific type of prosodic information, i.e., the acoustic realization of phrasal prominence, correlates with the natural languages’ word order of verbs and objects. Prominence is realized as a pitch/intensity contrast in O(bject)-V(erb) languages, where the prominent element is higher in pitch and/or intensity (Japanese: high-low, [‘To]kyo ni), and as a durational contrast in V(erb)-O(bject) languages, where the prominent element is lengthened (English: short-long, to [Ro]me). It has been argued that phrasal prominence might thus help infants learn the basic word order of their native language(s). Indeed, seven-month-old infants can use phrasal prominence to segment unknown artificial languages (Gervain & Werker 2013). VisCuesAcqWO seeks to examine whether co-verbal visual facial gestures such as eyebrow movements and head nods, are indeed available to prelexical infants which, together with auditory prosody―i.e., phrasal prominence―help them locate the boundaries of phrases and discover basic word order.
To that end, a production study was first conducted with adult native speakers of languages with opposite word orders, namely 7 English (VO) and 8 Japanese (OV) speakers. Participants were videotaped producing target phrases (e.g., English: behind curtains, Japanese: kabuto made) embedded in an invariant carrier sentence in their respective languages, both in Adult Directed Speech (ADS) and Infant Directed Speech (IDS). The facial gesture information present in the target phrases—specifically eyebrow movements and head nods—was measured in addition to their acoustic information, to determine whether a correlation exists between the prosodic cues associated to word order differences (changes in the pitch, intensity and duration of the speech signal), and the potentially available visual information. The acoustic analysis revealed the expected contrast in pitch in Japanese IDS and a similar trend in ADS. In English, the predicted durational contrast was found in ADS, additionally accompanied by a pitch contrast. Interestingly, a pitch contrast but no durational contrast was found in IDS. The presence of exaggerated pitch contours is indeed one of the most characteristic features of English IDS. Head motion was submitted to Optical Flow (OF) analysis. OF is a technique that computes time-varying changes in the direction (vertical and horizontal) and magnitude of motion in a specified region of interest (i.e., the speakers’ head), by comparing changes in pixel intensity in consecutive frames of a video. Greater vertical head motion in the phrase’s prominent element (e.g., curtains) was found in English IDS. A similar trend was found in ADS, in addition to significantly greater horizontal motion and magnitude of motion in the non-prominent element (e.g., behind). Greater motion in any direction was found in the non-prominent element (e.g., made) in Japanese IDS and ADS. A confirmatory analysis consistent of visual inspection of around 40% of the productions revealed that only the greater vertical motion found in the prominent element in English seemed to derive from the presence of head nods. Manual annotation of eyebrow movements revealed that the starts and peaks of movements occurred significantly more often in the first element of the target phrase both in Japanese an English, and more so in IDS than ADS, whereas ends of peaks and ends of movements occurred more frequently in the second element of the phrase in English and in an analysis of all productions collapsed.
These results suggest thus functional and cross-linguistic differences in the use of eyebrow movement and head nods. Eyebrow movements appear to signal the boundaries of the target phrase, whereas head nods appear to signal the phrase’s prominent element, though only in English. Further, eyebrow movements and head nods seem to be more frequent or larger in IDS than ADS. In sum, combining the available visual and prosodic information, might help infants, (i) locate the boundaries of phrases—marked by eyebrow movements, and (ii) detect the prominent element within the phrase—marked by prosodic information and possibly head nods—as well as the elements at the edges of phrases.
This available multimodal information might thus allow prelexical infants to attune to the basic word order of their language(s). However, whether infants actually are sensitive to and make use of these visual facial gestures remains to be determined. Therefore, in three series of currently ongoing experiments, we examine whether the presence of visual facial information modulates or determines the segmentation preferences of an ambiguous artificial language of: (i) 4- and 8-month-old monolingual infants, (ii) 8-month-old English-OV bilinguals, and (iii) adult monolinguals and bilinguals. The ambiguous languages contain the prosodic cues associated to word order, i.e., changes in pitch or duration, in addition to visual information—specifically head nods—displayed by means of a computer-generated avatar of a face. Crucially, participants (infants and adults) are presented with either (i) aligned visual and prosodic information: the head nods peak at the prosodically prominent—long or higher pitched—syllables, or (ii) misaligned visual and prosodic information: the head nods peak at the prosodically non-prominent—short or lower pitched—syllables. These sets of studies will allow us to determine whether infants and adults can use visual facial gestures to segment unknown languages, and the potential changes in the weight given to visual and prosodic information: (i) throughout development, (ii) when presented in conflict, and (iii) between monolinguals and bilinguals.

Gervain, J., & Werker, J.F. (2013). Prosody cues word order in 7-month-old bilingual infants. Nature Communications, 4, 1490.

• Expected final results and potential impact and use
This investigation will advance our understanding of early linguistic development and the role of visual information in the cognitive mechanisms involved in speech segmentation and the acquisition of syntax. Examining infants of two different age groups in addition to adult participants, not only monolingual but also bilingual, will allow us to better understand the developmental trajectory and potential changes in the relative weight given to visual facial gestures by these different populations. Further, establishing the mechanisms that enable such early discovery of the syntax of the native language would in turn allow us to uncover important milestones in typical development. Such redundant multimodal cues might be of particular importance in some language situations, e.g., for bilingual infants growing up with two languages that have particularly different grammars, such as Japanese and English, or Spanish and Basque. Last, the research conducted in the present project will allow us to examine novel languages typically understudied in the literature. This in turn would help us clarify the relative contributions of universal vs. language-specific mechanisms in development. A better understanding of early linguistic development, particularly in a bilingual context, has far-reaching societal implications in today’s multilingual and multicultural societies, e.g., for education.


Lucie Guilloteau, (European Affaires Responsible)
Tel.: +33 1 76 53 20 33


Life Sciences
Record Number: 188992 / Last updated on: 2016-09-19