SIGNAL IMAGE AND VIDEO PROCESSING, cilt.16, sa.4, ss.865-872, 2022 (SCI-Expanded)
In this study, we utilize attention mechanisms to leverage the spatio-temporal information available in videos for the action recognition and collective activity recognition tasks. In this context, we explore 2D and 3D attention mechanisms and investigate their effect on capturing the related action information. To this end, we introduce a framework for incorporating 2D and 3D-attention with two distinct 3D-ConvNets architectures, which are standard 3D-ConvNets (C3D) and inflated 3D-ConvNets (I3D). We evaluate this framework on four benchmark datasets; UCF101, and HMDB51 for action recognition and CAD and C-Sports for collective activity recognition. Experimental results show that the 3D attention-based ConvNets improves the performance on all datasets when compared to the architectures that do not leverage any attention mechanism. Our results also indicate that 3D attention mechanism yields higher recognition performance compared to its 2D attention counterpart.