Skip to main content

Integrating Object Recognition and ActiOn for action sentence production and comprehension in a developmental humanoid Robot

Final Report Summary - ORATOR (Integrating Object Recognition and ActiOn for action sentence production and comprehension in a developmental humanoid Robot)

Recent models from cognitive robotics research have recently begun to appreciate the entwined relationship between language and action, and have proposed to ground robots’ language understanding in sensorimotor representations. These models, however, focused mainly on the one-way relation between language and action, neglecting its bi-directional character. Hence, the primary aim of this research was to provide an experimental and computational framework for explaining how the processes of language interact with other processes such as motor control. We tackled this issue from the developmental robotics perspective. We focused on explaining the phenomenon of scale errors, i.e. children’s serious attempts to perform impossible actions on miniature objects. Two behavioural studies were performed.
The goal of the first study was to examine whether children’s attempts to perform impossible actions are aligned with their perception of possible and impossible actions. We tested children in a scale error elicitation phase to identify the children producing scale errors. Then we tested children in a computer-based eye-tracking phase, where they were familiarised with an animation of a possible action, followed either by an impossible or possible action. We targeted two groups of children aged 18 to 25 months (N=52) and 48 to 60 months (N=23). The results showed that older children and young children who did not make any scale errors were able to reliably distinguish possible from impossible actions. The inconsistency of the actions of children who made scale errors in the elicitation phase translated into perceptual inconsistencies: they were unable to reliably distinguish possible from impossible actions.
The second study examined the nature of scale errors and their possible relation to language development: does a child with a larger vocabulary perform more scale errors than a child with less advanced vocabulary? And if so, which types of words of vocabulary are particularly developed in children who make scale errors? In this study we replicated the scale errors situation and collected from each parent a representative estimate of the child’s vocabulary knowledge in comprehension and production (the Oxford Communicative Development Inventory). We collected a large sample of data (N = 125; age M = 23.50 range 18 – 30 months). We focused on three main factors, that is, gender, age and vocabulary size to explore more clearly the source of potential individual differences in the production of scale errors. We found no sex differences. Based on previous results, we expected to find the peak of scale errors incidence at 24 months and its decline or even absence when children grow older. In our study, the peak of scale errors incidence was at 18 months and decreased with age taking a negative linear relationship. We asked also, whether the size and structure of children’s vocabulary may be related to the number of scale errors they make. We looked at the individual children’s vocabularies and examined how subsets of known and produced words can influence children’s action selection. Our examination showed that, early talkers (i.e. children who scored on their OCDI above 75th percentile for their age) are more likely than late talkers (i.e. children who scored on their OCDI below 25th percentile for their age) to make scale errors.
Scale errors were hypothesised to result from immaturity in the interaction between the two different systems, namely the visual stream for action and the visual stream for perception, coupled to a lack of inhibitory control in children. The results of our study provided evidence that such dissociation in the action-perception system is influenced by the developing language skills as children in a particular period of their language development are more prone to show scale errors.
The behavioural studies provided more evidence supporting the hypothesis that perception, action and language develop in parallel and have an impact on each other. The hypotheses derived from the behavioural study were replicated in experiments. To this end, we designed a multimodal cognitive architecture based on stacked Restricted Boltzmann Machines (RBMs) that learn to associate object visual features, the action associated with the object and its name. The top-most associative layer integrated the inputs arising from 3 individual RBMs, each one for a different modality. We tested here several different possibilities of representing the perceptual information about the object (e.g. raw images, FREAK descriptors, shape descriptors together with size information, sparse vector encoding). The action and object name were represented as sparse binary vectors.
The results of our simulations showed that:
1. Language development has an impact on action. The computational model produces scale errors only after being trained with perception, action and language as input data, but not after being trained only with perception and action as input.
2. The experiences with objects matter. The computational model produces scale errors with objects but only after being trained with not equally distributed data (e.g. with 80% of big objects and 20% of small objects).
3. Scale errors are invertible. Empirical studies showed that scale errors happen with objects when an action typically selected for big objects is applied for miniature objects. Our experimentation with the computational model showed that scale errors may also happen with big objects, that is an action typically applied for small objects is applied for big objects instead. This prediction generated by our system could be tested in empirical studies with children.
We run a Principal component analysis on the associative layer to observe how the activations change over development. The results showed that activations for small and big objects overlap over a short period of development, which leads sometimes to a selection of the action that is associated with big object even though a small object is presented to the network or the opposite, may lead to a selection of the action that is associated with small object even though a big object is presented to the network.
This research contributes to the knowledge of cross-talk between language and motor structures, suggesting a possible developmental mechanism. Such multimodal architecture endows a robot with the ability to comprehend the object name, i.e. with the ability to recall perceptual features and possible interactions associated to the object. The robot, endowed with such architecture, can also name an object based on the perceptual features, or based on the possible interactions with that object.
The following deliverables of the project are of direct interest to the international scientific community engaged in research on cognitive embodiment, and specifically working on the language embodiment in robotics, as well as in neuroscience and psychology:
1. A theoretical model of the influence of language development on overt motor behaviour in children (i.e. scale errors) based on empirical findings, and the definition of cognitively inspired engineering principles for action and language learning and integration in robots.
2. A developmental model of scale errors phenomenon in children based on multimodal deep learning, and its experimental testing with the developmental robotics approach.
3. Implications for a robotic system able to comprehend and produce object names, to produce an action associated with an object when presented with the visual input or auditory input, or to find the appropriate object for the selected action.