Community Research and Development Information Service - CORDIS

Periodic Report Summary 1 - EAGLE (Exploiting semAntic and social knowledGe for visuaL rEcognition)

- Summary description of the project objectives:

"EAGLE: Exploiting semAntic and social knowledGe for visuaL rEcognition" aims to exploit big data for large-scale visual recognition problems. Web-based media sharing services like Flickr and YouTube, and social networks such as Facebook, have become more and more popular, allowing people to easily upload, share and annotate personal photos and videos. However, although visual data are the largest component of the digital world, we still lack effective computational methods for making sense of all this data. The main focus of this research project is on designing learning algorithms that make the most effective use of prior and contextual knowledge in presence of such noisy information.
Specifically, the idea of the project is to introduce a novel framework in which (i) the visual appearance of a concept, (ii) its semantic, and (iii) its “social manifestation” are represented in a coherent manner. In the recent years we have observed amazing progress in image classification, mostly driven by deep learning and huge datasets with manual annotations such as ImageNet. However, this results are obtained in a fully supervised setting. In this project, we investigate new methodologies to take advantage of the multiple contextual sources of "social knowledge" available on the web, such as the connections between users (e.g. Flickr groups), the semantic encoded in noisy user-generated tags or image descriptions, and the implicit hierarchy of visual concepts given by the visual similarity among concepts with a similar semantic content. To this end, the main technical challenge is to design models which are able to share and transfer this noisy prior knowledge to new test examples.

- Description of the work performed since the beginning of the project:

The main focus of the EAGLE project is on designing learning algorithms that make the most effective use of prior and contextual knowledge in presence of sparse and noisy labels. The first quarter of the outgoing phase was mostly devoted to the analysis of the prior work and the definition of the most interesting dataset to use for the experiments. We selected NUS-WIDE, which is a large dataset of 260,000 Flickr images, because is the largest and most used benchmark on the field and it provides also ground-truth data to validate the output of the proposed recognition models. We decided also to conduct some additional experiments on the video domain because it is particularly timing and we believe that it constitutes a good scenario to validate the robustness of the proposed approach.
The first year of the outgoing phase was then spent to design and develop the core model for automatic image annotation. The key assumption is that images that are difficult to recognize on their own may become more clear in the context of a neighborhood of related images with similar social-network metadata, such as user-generated tags or Flickr groups. To this end the fellow has deepen his knowledge in machine learning, and in particular in deep learning, taking advantage of the high expertise of the outgoing host in this specific field. Thus, the proposed model uses image metadata nonparametrically to generate neighborhoods of related images using semantic similarity and then uses a deep neural network to blend visual information from the image and its neighbors.
The second year was then spent on two main tasks. First, the fellow extended the original model by using also additional prior knowledge provided by external taxonomies. A complementary model was also proposed in order to test a different strategy to take advantage of both visual, semantic and social knowledge. Here the idea is to learn a joint embedding (using a CCA-based representation) to project the different views of the same entity - i.e. the visual features extracted from the image and, for example, its user tags - in a common semantic space. Both models have been evaluated on common benchmarks in order to compare the results to the previous works. Second, the fellow developed similar models for sharing knowledge in videos with noisy labels, and reported results on different benchmarks collected from YouTube and on Stanford campus. Very few prior works have addressed similar video understanding tasks.

- Description of the main results achieved so far:

The main results achieved so far can be summarized in three main points.
(1) An in-depth analysis of the prior works in the area of image tagging which contributed to a publication on the prestigious ACM Computing Surveys [CSUR16]. We also organized tutorials at top-tier conferences, such as ACM Multimedia 2015 [ACMM15] and CVPR 2016, to promote our findings and our novel experimental testbed.
(2) Our core image recognition model which has been presented at the Int’l Conference on Computer Vision 2015 (the premiere conference in Computer Vision) [ICCV15]. Here we build on top of recent success in deep learning using pre-trained CNN features on ImageNet, and we design a 2-layers network which is able to blend the visual information provided by a neighborhood of images which share similar semantic content. This neighborhood is obtained by measuring the similarity of multiple noisy sources and metadata, such as tags and Flickr groups. With this model we obtained state-of-the-art results, by a large margin, on the NUS-WIDE dataset that is the largest and most challenging benchmark in the field. We also extended the model by refining the label noise using external taxonomies, such as WordNet, and developed an alternative model which directly learns a joint visual-textual embedding space.
(3) Videos are the dark matter of the web. YouTube alone has over a billion users and every day, people watch hundreds of millions of hours of videos and generate billions of views. We started to investigate also the applicability of the key idea of transferring noisy source data in a cross-domain scenario, for different video recognition tasks. In this case, similarly to the previous image recognition model, we transfer the visual representation extracted from images labeled with the same tag of our test video in order to provide the localization of the concept in time. This is the key idea of our data-driven approach for tag refinement and localization in web videos, published on Computer Vision and Image Understanding [CVIU15]. Additionally, we show that the idea of sharing contextual semantic information of the scene can be used effectively to drive visual tasks such as motion understanding and prediction [ECCV16].

References:
[CSUR16] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, and A. Del Bimbo, "Socializing the Semantic Gap: A Comparative Survey on Image Tag Assignment, Refinement and Retrieval", ACM Computing Surveys, vol. 49, iss. 1, pp. 14:1-14:39, 2016 (Imp Fact 4.043)
[ECCV16] L. Ballan, F. Castaldo, A. Alahi, F. Palmieri, and S. Savarese, "Knowledge Transfer for Scene-specific Motion Prediction", Proc. of European Conference on Computer Vision (ECCV), Amsterdam, Netherlands, 2016
[ICCV15] J. Johnson*, L. Ballan*, and L. Fei-Fei, "Love Thy Neighbors: Image Annotation by Exploiting Image Metadata", Proc. of IEEE International Conference on Computer Vision (ICCV), Santiago, Chile, 2015 (* equal contribution)
[CVIU15] L. Ballan, M. Bertini, G. Serra, and A. Del Bimbo, "A Data-Driven Approach for Tag Refinement and Localization in Web Videos", Computer Vision and Image Understanding, vol. 140, pp. 58-67, 2015 (Imp Fact 1.540)
[ACMM15] X. Li, T. Uricchio, L. Ballan, M. Bertini, C. G. M. Snoek, and A. Del Bimbo, "Image Tag Assignment, Refinement and Retrieval", Proc. of ACM International Conference on Multimedia (ACM-MM), Brisbane, Australia, 2015 (Tutorial / Short course)

- Expected final results and their potential impact and use (including the socio-economic impact and the wider societal implications of the project so far):

Images and videos want to be shared. Nowadays, several technological developments have spurred the sharing of images in unprecedented volumes. The first is the ease with which visual content can be captured in a digital format by cameras and cellphones. The second is the Web that allows transfer of visual data to anyone, anywhere in the world, especially thanks to popular social media. Some recent works have begun to take social network analysis into account when looking at big visual data, but their scope has been too limited. Researchers have presented algorithms for enriching user annotations through tag recommendation strategies (often focusing only on tags and their relations), and methods for estimating tag relevance in order to rank and filter the list of tags. Nevertheless, most of these studies were conducted in simplified settings and do not address the full information given by visual media and their social and semantic relations.
Our work goes beyond these studies by introducing a new framework which jointly models objective knowledge, given by the visual appearance, prior knowledge, given by visual and linguistic ontologies, and noisy collective knowledge, given by social tags and social relations. This is a challenging issue since, at the scale of web, we are looking at connections of thousands of elements. The main outcome of the EAGLE project is a pool of models for sharing knowledge in the specific context of noisy and uncertain data, such as in the web scenario. Many of the recent advancements in visual recognition tasks, such as image classification and object detection, are mostly driven by deep neural networks and large datasets of manually annotated data, in a fully supervised setup. But this can not scale at the web-scale, or it can introduce a strong bias towards a specific task or domain. Having a model which is able to take advantage of the already available prior knowledge, but also exploits the noisy data available on the web, can be an effective way to analyse the big visual data. This might have a huge impact on several different domains, ranging from user behaviour analysis and prediction on the web, data recommendation and augmentation, smart cities (where the user interests can be predicted from her/his visual repositories).

Contact

Alberto Del Bimbo, (Full Professor)
Tel.: +39 055 4796262
Fax: +39 055 2751396
E-mail

Subjects

Life Sciences
Follow us on: RSS Facebook Twitter YouTube Managed by the EU Publications Office Top