Skip to main content

Synchronous Linguistic and Visual Processing

Final Report Summary - SYNPROC (Synchronous Linguistic and Visual Processing)

When humans process language, they rarely do so in isolation. Linguistic input often occurs synchronously with visual input, e.g. in everyday activities such as attending a lecture or following directions on a map. The visual context constrains the interpretation of the linguistic input, and vice versa, making processing more efficient and less ambiguous.

The SynProc project studied synchronous linguistic and visual processing by tracking participants' eye movements when they view a naturalistic scene and listen to speech at the same time. The project investigated how synchronous processing is influenced by factors such as saliency (visual prominence in a scene), referential ambiguity (one word corresponding to multiple objects), and context (the set of objects that occur in a scene). Our results indicate that visual saliency is used by the language comprehension system to make predictions (e.g. about which argument a verb takes). Similarly, properties of the context (e.g. how cluttered the scene is and whether it contains animate objects) is used by the language production system to decide which words to choose when formulating a sentence. A striking experimental result is the fact that it is possible to predict with some accuracy what a speaker will say based on their scan patterns on a visual scene, i.e. based on the sequence of objects that they fixate, even before they start to speak.

SynProc's experimental results fed into a series of computational models that predict the eye-movement patterns that humans follow when they view a scene and comprehend or produce speech at the same time. Our key modeling idea is to treat synchronous processing as an alignment problem, for which a rich literature exists. The project developed a baseline alignment model based on maximum entropy classifier. Such a model is not able to do justice to the hierarchical (tree-like) structure of language; we therefore developed a hierarchical alignment model by drawing on techniques from synchronous parsing. The model aligns visual objects and linguistic units in the same way as the phrases of two languages in machine translation. This model was successfully used to simulate a range of language/vision tasks, including image parsing, image description, and image retrieval.

In the course of the project it was realized that eye-tracking data can also be applied to core computer vision problems, such as object detection. We therefore collected a large set of images together with bounding boxes marking the objects, and eye-movement data produced by humans in a visual search task. Using this data set, we were able to train an object classifier and achieve performance on a par with models using bounding boxes (assuming equal annotation time). This approach has the potential of scaling to more complicated computer vision problems (e.g. action recognition), reducing the time required for annotation dramatically.