Community Research and Development Information Service - CORDIS

ERC

HELIOS Report Summary

Project ID: 321162
Funded under: FP7-IDEAS-ERC
Country: United Kingdom

Mid-Term Report Summary - HELIOS (Towards Total Scene Understanding using Structured Models)

The HELIOS project was designed with an ambitious goal in mind: working towards providing computers with a comprehensive understanding of a scene by building on Torr Vision Group's significant body of existing work on segmenting images of scenes using structured models such as conditional random fields (CRFs). As outlined in the original proposal, the project encompasses a number of different research areas: (i) semantic segmentation, which focuses on segmenting images and identifying the objects within them; (ii) semantic reconstruction, which involves the reconstruction of a 3D model of the scene and the identification of the objects within it in 3D; (iii) action recognition, which focuses on detecting and describing actions that are taking place in a video sequence, and (iv) natural language description of scenes, which focuses on describing what can be seen in an image to a human user (a topic that has significant potential uses in helping people who are partially-sighted).

In the 30 months that have elapsed since the start of the project, we have made significant progress in each of these areas, and we provide a summary of some of the highlights:

(i) In semantic segmentation, we have shown [i.1] how to encode a CRF model as a recurrent neural network (RNN), allowing it to be embedded into a larger neural network that can then be trained end-to-end in order to yield significant improvements to the state-of-the-art results for segmenting images into categories. This work won a best demo award at ICCV and has attracted a huge amount of interest from the computer vision community: at the time of writing, it already has 198 citations on Google Scholar, and the open-source implementation of the approach that we have released on GitHub has been forked by 227 people. (This makes it currently the 10th most forked Matlab project on the whole of GitHub.) This work has subsequently been extended [i.2] to segment individual objects in an image.

(ii) In semantic reconstruction, we have shown [ii.1] how to facilitate the segmentation of reconstructed 3D scenes by allowing the user to physically interact with the scene in the real world using touch. Our segmentation framework, SemanticPaint, trains an online random forest classifier based on real-world user interactions and uses it to predict a category-level segmentation of the entire scene. This work was demo-ed live (over a five-day period) at SIGGRAPH 2015 [ii.2] and has attracted significant interest from both other researchers and the popular media [ii.3]. We are currently extending this work to support the reconstruction and detection of individual objects in the scene. The work itself is based on Oxford's popular InfiniTAM reconstruction engine [ii.4], to which we have made significant contributions.

We have also done major work on large-scale outdoor semantic reconstruction [ii.5] from stereo imagery, and on interactive outdoor scene labelling [ii.6]. The former work in particular was a finalist for the Robot Vision Best Paper Award at ICRA 2015.

(iii) In action recognition, we have proposed [iii.1] a new approach to the spatio-temporal localisation (detection) and classification of multiple concurrent actions within temporally-untrimmed videos. Our action detection framework takes advantage of the latest improvements in deep learning to provide a frame-by-frame prediction of action locations and categories. A final two-stage optimisation process is used to connect the detections into coherent space-time detections called tubes. Our action detection approach surpassed the current state-of-the-art and for the first time, we were able to show the detection of multiple concurrent actions in temporally-untrimmed videos in which a single action may be composed of several people, such as 'discussion' and 'handshaking'.

(iv) For natural language description of scenes, we have established a collaboration with the Smart Specs group of Dr Stephen Hicks (at the Nuffield Department of Clinical Neurosciences in Oxford). They are developing smart glasses to assist people who are partially-sighted to navigate around and interact with the world more freely. Describing the contents of the scene to a partially-sighted user is a particularly challenging form of the natural language description problem, because the descriptions produced must both take into account the needs of the partially-sighted user and be unobtrusive enough to avoid overwhelming the user's sense of hearing. Our research in this area is still at a relatively early stage, but the glasses themselves are already in the process of being deployed in the real world, providing us with an excellent platform on which to test out our ideas.

[i.1] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su, Dalong Du, Chang Huang, Philip H.S. Torr, Conditional Random Fields as Recurrent Neural Networks, IEEE International Conference on Computer Vision (IEEE ICCV) 2015, 2015 (poster).

[i.2] Anurag Arnab, Philip H.S. Torr, Bottom-up Instance Segmentation using Deep Higher-Order CRFs, British Machine Vision Conference (BMVC) 2016.

[ii.1] Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, David Kim, Jamie Shotton, Pushmeet Kohli, Matthias Niessner, Antonio Criminisi, Shahram Izadi, Philip H.S. Torr, SemanticPaint: Interactive 3D Labeling and Learning at your Fingertips, ACM Transactions on Graphics, 2015 (oral).

[ii.2] Stuart Golodetz, Michael Sapienza, Julien Valentin, Vibhav Vineet, Ming-Ming Cheng, Victor Adrian Prisacariu, Olaf Kaehler, Carl Yuheng Ren, Anurag Arnab, Stephen Hicks, David W. Murray, Shahram Izadi, Philip H.S. Torr, SemanticPaint: Interactive Segmentation and Learning of 3D Worlds, Proceeding ACM SIGGRAPH 2015 Emerging Technologies, 2015 (live demo).

[ii.3] See BBC Click (26/09/15, 01/10/15, 26/12/15).

[ii.4] Olaf Kaehler, Victor Adrian Prisacariu, Carl Yuheng Ren, Xin Sun, Philip H.S. Torr, David W. Murray, Very High Frame Rate Volumetric Integration of Depth Images on Mobile Devices, IEEE Transactions on Visualization and Computer Graphics, 2015 (oral).

[ii.5] Vibhav Vineet, Ondrej Miksik, Morten Lidegaard, Matthias Niessner, Stuart Golodetz, Victor Adrian Prisacariu, Olaf Kaehler, David W. Murray, Shahram Izadi, P. Perez, Philip H.S. Torr, Incremental Dense Semantic Stereo Fusion for Large-Scale Semantic Scene Reconstruction, IEEE International Conference on Robotics and Automation (ICRA), 2015 (oral).

[ii.6] Ondrej Miksik, Vibhav Vineet, Morten Lidegaard, Ram Prasaath, Matthias Niessner, Stuart Golodetz, Stephen Hicks, P. Perez, Shahram Izadi, Philip H.S. Torr, The Semantic Paintbrush: Interactive 3D Mapping and Recognition in Large Outdoor Spaces, Proceedings of the 33nd annual ACM conference on Human factors in computing systems (CHI), 2015 (oral).

[iii.1] Suman Saha, Gurkirt Singh, Michael Sapienza, Philip H.S. Torr, Fabio Cuzzolin, Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos, British Machine Vision Conference (BMVC) 2016.

Reported by

THE CHANCELLOR, MASTERS AND SCHOLARS OF THE UNIVERSITY OF OXFORD
United Kingdom
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top