Final Report Summary - COGNIMUND (Cognitive Image Understanding: Image representations and Multimodal learning)
1. Novel image/video representations
Good representations incorporate the intrinsic structure of images. These should already go a long way towards removing irrelevant sources of variability while capturing the essence of the visual content. Depending on the task at hand (e.g. image classification, object recognition, pose estimation, or action recognition), different representations may be optimal. So different representations for each of these tasks were developed, each time pushing the state-of-the-art. Here, we only describe a few of them.
For image classification, we built on pattern mining techniques to construct a mid-level representation combining low-level features in a principled manner, and showed that the performance in terms of image classification could be improved considerably.
We also studied the interaction between different object instances. Rather than detecting them one by one, we consider them jointly. Reasoning about the relative spatial configurations, false detections can be removed, missed detections can be recovered, and the pose of each of the objects can be determined more accurately, based on statistical relational learning.
Further, we developed a new representation for action recognition. Modeling the temporal evolution in a video has challenged researchers for over a decade. To tackle this problem, we focus on the relative temporal ordering of frames. We train a linear ranking machine to chronologically order frames of a video, and use its parameters as new representation, with very convincing results.
2. Weakly supervised methods to learn from multimodal input
For genuine cognitive-level image understanding, methods need to scale to hundreds or thousands of object classes. Large datasets with class labels are now available, based on which classifiers can be trained.
For object detection, on top of the class label, a 2D bounding box is required for each training image. To relax this constraint, we studied weakly supervised methods. Exploiting similarity across different training images, we jointly learn the bounding boxes and the corresponding model. Extra cues can then be integrated, such as mirror symmetry (i.e. a mirrored image of a car still contains a car) or mutual exclusion (e.g. a bounding box can be either a sofa or a television but not both).
Learning purely from images and naturally co-occurring text, without any external data, is actually quite difficult, given the high complementarity and limited redundancy between the two modalities. We solved this problem for specific settings, naming faces based on image captions, or locations based on scripts in soap series. More interestingly, we developed a scheme for image retrieval that exploits this complementarity, using as input an example image and a textual modifier.
Another problem we looked at is domain shift. Good results are often obtained when training a model on a given dataset and evaluating it on a different part of that same dataset. However, results may be much worse when evaluating on new data (e.g. your own photo collection). To overcome this problem, we proposed a method for domain adaptation, that effectively transforms the representation, bringing the training (source) data closer to the test (target) data distribution. When text data is available for the target data, e.g. in the form of subtitles, these can additionally be taken into account during the domain adaptation process.
Finally, we also looked at exploiting 3D CAD models to improve object recognition.