European Commission logo
English English
CORDIS - EU research results

The emergence of understanding from the combination of innate mechanisms and visual experience

Final Report Summary - DIGITALBABY (The emergence of understanding from the combination of innate mechanisms and visual experience)

The overall goal of the 'Digital Baby' research effort is a computational study of how knowledge of the world emerges from the combination of innate mechanisms and visual experience. By watching its environment and interacting with it, the model develops on its own representations of complex concepts that allow it to understand the world around it, in terms of objects, object categories, events, agents, actions, goals, social interactions, and the like. We dubbed the model system a ‘digital baby’ because it faces similar problems to an infant trying to use her/his experience to understand the world. Research results are listed briefly below under two main parts: ‘Innate structures and learning’, which shows how innate structures guide the learning process, and ‘Learning to understand the world’, which models the evolving understanding of complex visual scenes.

Innate structures and learning

Learning to perceive coherent objects: adults naturally perceive the surrounding scene as segregated into coherent objects. In contrast, infants have initially a surprisingly impoverished capacity to segregate the scene into objects, and this ability is gradually learned from experience. Our model has shown how starting from simple capacities known to exist at an early age, the model learns complex object segregation in an entirely unsupervised manner by observing videos of objects in motion.

Learning about hands and their interactions with objects: In learning to understand actions and goals, an important part is identifying the agents’ hands, their configuration and interactions with objects. Detecting hands, paying attention to what they are doing, and using them to make inferences and predictions, are natural for humans and appear early in development. The Digital Baby learns about hand better than any existing models by using an empirically motivated ‘guiding signal’ – that hands are used to move objects around.

Learning to perceive direction of gaze: Infants can detect and follow another person’s gaze, and this skill, which begins to develop around 3-6 months, plays an important role in the development of communication and language. The model learns on its own to detect peoples’ direction of gaze using the fact that people look at an object when they make a contact with it, e.g. to move or lift it up.

The implications of these parts of the model go beyond the specific tasks, by showing how innate guiding signals provide a major general mechanism guiding early learning.

Learning to understand the world:

A major part of the digital baby project was to develop an increasingly complex understanding of the visual environment, including the representation and classification of objects, actions, and social interactions. This has been obtained in the Digital Baby model by the following process.

The recognition of minimal images: A minimal recognizable image is an image patch that can be reliably recognized by human observers, and which is minimal in the sense that further reduction by either size or resolution makes the patch unrecognizable. We have developed methods to identify minimal images, and learn to recognize them efficiently. A crucial advantage of minimal images for complex image understanding is that they can always be recognized reliably on their own, independent of the surrounding context.

Full image interpretation: We next developed a model that uses the initial recognition of minimal images to obtained a full and detailed interpretation of their internal structure. For instance, the initial phase may identify ‘a man’s torso’, and the full interpretation stage will then identify all the visible components within this local image, such as neck, shirt, suit, collar, pocket, tie, and more.

Applications to objects, actions, and social interactions: Following the interpretation at the level of minimal images, the model expands the interpretation process to surrounding regions, which are initially ambiguous on their own. For example, the process may recognize the action of drinking from a cup, by first recognizing a face image, using internal interpretation to identify the agent's mouth, and proceeding to identify the object docked at the mouth as a cup, even when the cup is unrecognizable on its own. We showed how this evolving process can be naturally used for recognizing complex actions and social interactions.

In summary, the Digital Baby combines a computational model with evidence from human cognition and learning, to show how the ‘emergence of understanding’ can be obtained from the combination of innate mechanisms and visual experience. The model leads to a better understanding of learning processes by humans, and provides novel methods for learning by intelligent systems.