Skip to main content

Visual Recognition

Final Report Summary - VISREC (Visual Recognition)

The goal of this project was to develop the fundamental knowledge to design a visual system that is able to learn, recognize and retrieve quickly and accurately thousands of visual categories, including objects, scenes, human actions and activities. A "visual google" for images and videos - able to search for the "nouns" (objects, scenes), "verbs" (actions/activities) and adjectives (materials, patterns) of visual content.

Progress has been made on a number of fronts including: (i) learning visual models on-the-fly to retrieve semantic entities in large scale image and video collections starting from a text query - this has enabled visual retrieval of people (from faces), object categories (such as vehicles, animals) and object instances (such as particular buildings, particular paintings); (ii) automatic identification of flower species and sculptures; (iii) methods and models for detecting and localizing object categories in images - in particular reducing the level of supervision that is required when training such models; and (iv) deep learning methods for recognizing object categories, text, and human actions and inter-actions (such as hand-shakes) in images and videos.

The outcomes of this research will impact any applications where visual recognition is useful, and will enable new applications entirely: effortlessly searching and annotating home image and video collections on their visual content; searching and annotating large commercial image and video archives (e.g. YouTube); extending the class of images that can be used to access the web (in the manner of Google Goggles) and hence identify their visual content.