Skip to main content

Ecological Language: A multimodal approach to language and the brain

Periodic Reporting for period 2 - ECOLANG (Ecological Language: A multimodal approach to language and the brain)

Reporting period: 2019-07-01 to 2020-12-31

In everyday settings, language is learnt and mostly used in face-to-face contexts in which multiple cues contribute to comprehension. These cues include: what is being said, intonation of the voice, the face (especially mouth movements) and the hand-gestures that are time-locked and related to what is being said – such as pointing to an object while talking about it, or gestures that evoke imagery of what is being said. Yet, our knowledge of the psychological and neural mechanisms underscoring how language is learnt and used comes almost exclusively from studies focusing on speech or text in which the rich multimodal context is not taken into account.

ECOLANG studies language learning and comprehension in the real-world, investigating language as the rich set of cues available in face-to-face communication. We ask whether and how do children and adults use multimodal cues to learn new vocabulary and to process known words. We further ask how the brain integrates the different types of multimodal information during language learning and comprehension.

Using a real-world approach to language learning and comprehension provides key novel insights that can enhance treatments for developmental and acquired language disorders. It also provides novel constraints for automatic language processing, leading to improved performance by automatic systems in learning and processing language and in interacting with humans.
We have nearly completed the collection of the ECOLANG corpus. As far as we are aware, this is the first corpus of multimodal communication between two individuals that allows to assess how social interaction can support effective communication but also learning of new concepts. We have created new materials and procedures for eliciting communication about novel objects in both groups. The use of the same protocol for child-adult and adult-adult dyads will further allow us to directly compare child and adult directed language. Analyses of the corpus data (adult-child) shows that caregivers use multimodal cues such as gesture, or prosodic modulations differently depending upon: the age of the child; whether the child knows the object being talked about; and whether the object is in view (Kewenig, Brieke, Gu & Vigliocco, 2020; Motamedi, et al., 2019; Shi, Gu, Grzyb & Vigliocco, 2020; Vigliocco et al., 2019). Crucially, also the way in which the cues co-occur also changes depending upon whether the object is novel (Chen et al., 2020). We also found that caregivers’ use of certain cues are correlated to children’s immediate learning of words and vocabulary size. For instance, a caregiver’s greater adjustment in speaking rate and pitch between known and unknown words predicts better immediate learning (Shi et al., 2020). Caregivers’ amount of gazing to novel objects is related to child’s vocabulary size (Grzyb, Cheng & Vigliocco, in prep). Thus, the annotated corpus provides us with key information regarding when speakers use multimodal cues in their communication and how these cues are orchestrated. Crucially we have shown that: (1) the cues are produced when they are most useful and therefore they cannot be argued to be only embellishments to speech; (2) the presence of these cues predicts learning in the children.

We have begun assessing the interaction between linguistic predictability (measured as linguistic surprisal based on n-gram or RNN models) and multimodal cues such as prosodic modulation and presence of gestures. We ask whether word surprisal predicts whether speakers would produce a prosodic modulation (we focused on word duration), expecting to see longer durations for words that are less predictable in the preceding context. We also looked at representational gestures (gestures that imagistically refer to what is being talked about) expecting to see that speakers produce more gestures for more surprising words. This is precisely what we observed (Grzyb, Vigliocco & Frank, in prep). Overall this work indicates the potential of using computational models to assess the interdependence between the use of specific words and multimodal cues such as gestures or prosodic modulation.

In behavioural studies, we developed quantification of informativeness of gestures, and especially mouth movements to assess the impact of multimodal cues such as gestures and mouth movements and their interactions on word recognition. We have found that we found that gesture informativeness and mouth movements speed up word recognition (Krason, Fenton & Vigliocco, in prep). In electrophysiological work using naturalistic stimuli, we have investigated whether the presence of multimodal cues (and their combination) modulate a biomarker of processing difficulty (N400). We have found that all the multimodal cues we investigated (representational gestures, beats, prosodic stress and mouth movements) affect the processing of words for first and second language users indicating that these cues are central to language processing. We also found that their impact dynamically changes depending upon their informativeness and finally, we found a hierarchy: prosody shows the strongest effect followed by gestures and mouth movements (Zhang, Frassinelli, Tuomainen & Vigliocco, 2020). Thus, these studies provide a first snapshot into how the brain dynamically weights audiovisual cues in language comprehension.
ECOLANG pioneers a new way to empirically study spoken language. We study real-world language as a multimodal phenomenon: the input comprises linguistic information, but also prosody, mouth movements, eye gaze and gestures. This contrasts with the current mainstream reductionist approach in which language is most often reduced to speech or text. By bringing in the multimodal context, we blur the traditional distinction between language and communication currently present in linguistics, psychology of language, and neurobiology of language. Crucially, we also study language in real-world settings in which interlocutors are adults or children and learning new words is intertwined with processing known words. This contrasts with current approaches in which language processing is studied in adults and language learning is studied in children.

Our starting point is the development of a corpus of dyadic communication between an adult and a child, or between two adults. To our knowledge, this will be the first naturalistic annotated corpus that comprises both adult-to-adult and adult-to-child conversations that manipulate key aspects of the context. These manipulations are expected to bring about different weighted combinations of the cues, e.g. will visible cues (gesture, mouth movements), in addition to prosody, be more prominent in language toward a child than in adult directed language? Is gesture different when objects are present (pointing to the objects) vs when they are absent (gestures iconic of referent properties)? We will also develop initial computational models of how multimodal cues are combined in spoken language, tested against results of behavioural and electrophysiological studies investigating whether and how the multimodal cues affect language learning and processing. Finally, we expect to obtain some of the very first evidence concerning how neural networks are orchestrated in multimodal language combining fMRI and patients’ studies.
example of a typical interaction from our corpus