One of the primary and most appealing goals of computer vision is to automatically understand the content of images on a cognitive level. Ultimately we want to have computers interpret images as we humans do, recognizing all the objects, scenes, and people as well as their relations as they appear in natural images or video. With this project, I want to advance the state of the art in this field in two directions, which I believe to be crucial to build the next generation of image understanding tools. First, novel more robust yet descriptive image representations will be designed, that incorporate the intrinsic structure of images. These should already go a long way towards removing irrelevant sources of variability while capturing the essence of the image content. I believe the importance of further research into image representations is currently underestimated within the research community, yet I claim this is a crucial step with lots of opportunities good learning cannot easily make up for bad features. Second, weakly supervised methods to learn from multimodal input (especially the combination of images and text) will be investigated, making it possible to leverage the large amount of weak annotations available via the internet. This is essential if we want to scale the methods to a larger number of object categories (several hundreds instead of a few tens). As more data can be used for training, such weakly supervised methods might in the end even come on par with or outperform supervised schemes. Here we will call upon the latest results in semi-supervised learning, datamining, and computational linguistics.
Fields of science
Call for proposal
See other projects for this call