Recognize Complex Events From Static Images by Fusing Deep Channels
Yuanjun Xiong, Kai Zhu, Dahua Lin, Xiaoou Tang; Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015, pp. 1600-1609
Abstract
A considerable portion of web images capture events that occur in our personal lives or social activities. In this paper, we aim to develop an effective method for recognizing events from such images. Despite the sheer amount of study on event recognition, most existing methods rely on videos and are not directly applicable to this task. Generally, events are complex phenomena that involve interactions among people and objects, and therefore analysis of event photos requires techniques that can go beyond recognizing individual objects and carry out joint reasoning based on evidences of multiple aspects. Inspired by the recent success of deep learning, we formulate a multi-layer framework to tackle this problem, which takes into account both visual appearance and the interactions among humans and objects, and combines them via semantic fusion. An important issue arising here is that humans and objects discovered by detectors are in the form of bounding boxes, and there is no straightforward way to represent their interactions and incorporate them with a deep network. We address this using a novel strategy that projects the detected instances onto multi-scale spatial maps. On a large dataset with $60,000$ images, the proposed method achieved substantial improvement over the state-of-the-art, raising the accuracy of event recognition by over $10\%$.
Related Material
[pdf]
[
bibtex]
@InProceedings{Xiong_2015_CVPR,
author = {Xiong, Yuanjun and Zhu, Kai and Lin, Dahua and Tang, Xiaoou},
title = {Recognize Complex Events From Static Images by Fusing Deep Channels},
booktitle = {Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2015}
}