Skip to main content

Exploiting semAntic and social knowledGe for visuaL rEcognition

Final Report Summary - EAGLE (Exploiting semAntic and social knowledGe for visuaL rEcognition)

"EAGLE: Exploiting semAntic and social knowledGe for visuaL rEcognition" aimed to exploit big data for large-scale visual recognition problems. Web-based media sharing services like Flickr and YouTube, and social networks such as Facebook, have become more and more popular, allowing people to easily upload, share and annotate personal photos and videos. However, although visual data are the largest component of the digital world, we still lack effective computational methods for making sense of all this data. Therefore, the main focus of this research project was on designing learning algorithms that are able to make the most effective use of prior and contextual knowledge in presence of such noisy information.
Specifically, the key idea of the project was to introduce a novel framework in which (i) the visual appearance of a concept, (ii) its semantic, and (iii) its social manifestation are represented in a coherent manner. In the recent years we have observed amazing progress in image classification, mostly driven by deep learning and huge datasets with manual annotation such as ImageNet. However, this results are usually obtained in a fully supervised setting. In this project, we presented new methodologies for taking advantage of the multiple contextual source of "social knowledge" available on the web, such as connection between users (e.g. Flickr groups), the semantic encoded in noisy user-generated tags or descriptions, and the implicit hierarchy of visual concepts given by the visual similarity among concepts with a similar semantic content. To this end, the main technical challenge faced by our work was to design models which are able to share and transfer this noisy prior knowledge to new examples.

- Description of the work performed since the beginning of the project:

As previously reported, the main focus of the EAGLE project was on designing and developing learning algorithms that make the most effective use of prior and contextual knowledge in presence of sparse and noisy labels. To this end, we needed data to train our models and a benchmark to validate our results. The NUS-WIDE dataset, which is a large collection of approx. 260,000 Flickr images, has been selected in the early stages of the project as the main experimental testbed. The main reason of this choice is that it was the largest and most popular benchmark in the field which also provides ground-truth data to validate the output of the proposed recognition models. Subsequently, we decided to conduct additional experiments (and collect additional data) on the video domain because it was particularly timing and we believed that videos constitute a very good scenario to validate the robustness of the proposed approach. The key assumption of the method is that images that are difficult to recognize on their own, may become more clear in the context of a neighborhood of related images with similar social-network metadata, such as user-generated tags or Flickr groups. Thus, the proposed model uses image metadata nonparametrically to generate neighborhoods of related images using semantic similarity and then uses a deep neural network to blend visual information from the image and its neighbors. This original model was further extended by using additional prior knowledge provided by external taxonomies. To this end, a complementary model was also proposed in order to test a different strategy which takes advantage of both visual, semantic and social knowledge. Here the proposed idea was to learn a joint embedding (using a CCA-based representation) to project the different views of the same entity - i.e. the visual features extracted from the image and, for example, its user tags - in a common semantic space. Both models have been evaluated on common benchmarks (such as NUS-WIDE and other popular datasets for image annotation) in order to compare the results to the previous works.
The amazing recent advancements in deep learning (both in terms of visual recognition performances and computing power provided by GPUs) and our significant achievements obtained in the early stages of the project, allowed us to conduct also large-scale visual recognition experiments in the video domain. Videos are the "dark matter" of the web. YouTube alone has over a billion users and everyday, people watch hundreds of millions of hours of videos and generate billions of views. In the last period of the project, we started to investigate the applicability of our key idea of sharing and transferring knowledge from noisy source data in a cross-domain scenario, for different video recognition tasks. In a first application, we have shown that the visual representation extracted from images labeled with similar tags can be used in order to provide the localization of a specific concept in time. Moreover, we have demonstrated that our models learned from large-scale collections of noisy images can be used in fully webly-supervised learning scenario for action recognition in YouTube videos. In a second application, we showed that the idea of sharing contextual semantic information of the scene can be also used effectively to drive visual tasks such as motion understanding and prediction in videos. These last results open up a fascinating perspective in a large spectrum of applications in which it is very hard to collect sufficient training data in order to learn good prediction models.

- Main results achieved by the project:

The main results achieved by the EAGLE project can be summarized in three main points.
(1) First, this project required an in-depth analysis of the vast prior works in the area of image tagging which contributed to a publication on the prestigious ACM Computing Surveys [CSUR16]. As a result of this work, we also organized tutorials at top-tier international conferences, such as ACM Multimedia 2015 and CVPR 2016, to promote our findings and our novel experimental testbed.
(2) Second, our core image recognition model - which has been originally presented at the Int’l Conference on Computer Vision 2015 (the premiere conference in Computer Vision) [ICCV15] - introduced a new paradigm for multi-label image classification which obtained state-of-the-art results on the popular NUS-WIDE benchmark. Here we built on top of recent success in deep learning using pre-trained CNN features on ImageNet, and we designed a 2-layers neural network which was able to blend the visual information provided by a neighborhood of images sharing similar semantic content. This neighborhood was obtained by measuring the similarity of multiple noisy sources and metadata, such as tags and Flickr groups. We also extended this idea by introducing a novel mechanism to refine the label noise using external taxonomies, such as WordNet, and developed an alternative model based on a joint visual-textual embedding [PR17].
(3) Third, video understanding tasks such as human action recognition and behaviour analysis has been recognized as one of the main open challenges in computer vision. Therefore, in the final part of the project we started to investigate the applicability of our key idea of transferring noisy source data in a cross-domain scenario, for different video recognition tasks. To this end, we designed a transfer learning approach to share the visual representation extracted from images labeled with the same tag of our test video in order to provide the localization of the concept in time. This was the key idea of our data-driven approaches for tag localization in web videos, and webly-supervised learning for action recognition, published on Computer Vision and Image Understanding [CVIU15, CVIU17]. Additionally, we showed that the idea of sharing contextual semantic information of the scene can be used effectively to drive visual tasks such as motion understanding and prediction [ECCV16].

- Representative publications:
[CVIU17] C. Rupprecht, A. Kapil, N. Liu, L. Ballan, and F. Tombari, "Learning without Prejudice: Avoiding Bias in Webly-Supervised Action Recognition", Computer Vision and Image Understanding, vol. in press, 2017 (IF: 2.498)
[PR17] T. Uricchio, L. Ballan, L. Seidenari, and A. Del Bimbo, "Automatic Image Annotation via Label Transfer in the Semantic Space", Pattern Recognition, vol. 71, pp. 144-157, 2017 (IF: 3.399)
[ECCV16] L. Ballan, F. Castaldo, A. Alahi, F. Palmieri, and S. Savarese, "Knowledge Transfer for Scene-specific Motion Prediction", Proc. of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016
[CSUR16] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, and A. Del Bimbo, "Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval", ACM Computing Surveys, vol. 49, iss. 1, pp. 14:1-14:39, 2016 (Imp Fact 4.043)
[ICCV15] J. Johnson*, L. Ballan*, and L. Fei-Fei, "Love Thy Neighbors: Image Annotation by Exploiting Image Metadata", Proc. of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015 (* equal contribution)
[CVIU15] L. Ballan, M. Bertini, G. Serra, and A. Del Bimbo, "A Data-Driven Approach for Tag Refinement and Localization in Web Videos", Computer Vision and Image Understanding, vol. 140, pp. 58-67, 2015 (Imp Fact 1.540)

- Contact:
More information about the project, the main results achieved (publications, publicly available software and data, etc.) and related dissemination activities, are available on the project website: A dissemination of the EAGLE research activities and results was also regularly done via Dr. Ballan's professional Twitter account (lambertoballan).