IEEE TRANSACTIONS ON MULTIMEDIA, vol.14, no.4, pp.1031-1045, 2012 (SCI-Expanded)
Action recognition in uncontrolled videos is a challenging task, where it is relatively hard to find the large amount of required training videos to model all the variations of the domain. This paper addresses this challenge and proposes a generic method for action recognition. The idea is to use images collected from the Web to learn representations of actions and leverage this knowledge to automatically annotate actions in videos. For this purpose, we first use an incremental image retrieval procedure to collect and clean up the necessary training set for building the human pose classifiers. Our approach is unsupervised in the sense that it requires no human intervention other than the text querying to an internet search engine. Its benefits are two-fold: 1) we can improve retrieval of action images, and 2) we can collect a large generic database of action poses, which can then be used in tagging videos. We present experimental evidence that using action images collected from the Web, annotating actions in the videos is possible. Additionally, we explore how the Web-based pose classifiers can be utilized in conjunction with limited labelled videos. We propose to use "ordered pose pairs" (OPP) for encoding the temporal ordering of poses in our action model, and show that considering the temporal ordering of pose pairs can increase the action recognition accuracy. We also show that by selecting the keyposes with the help of Web-based classifiers, the classification time can be reduced. Our experiments demonstrate that, with or without available video data, the pose models learned from the Web can improve the performance of the action recognition systems.