SIGNAL IMAGE AND VIDEO PROCESSING, cilt.12, sa.6, ss.1197-1205, 2018 (SCI-Expanded)
In activity recognition, usage of depth data is a rapidly growing research area. This paper presents a method for recognizing single-person activities and dyadic interactions by using deep features extracted from both 3D and 2D representations, which are constructed from depth sequences. First, a 3D volume representation is generated by considering spatiotemporal information in depth frames of an action sequence. Then, a 3D-CNN is trained to learn features from these 3D volume representations. In addition to this, a 2D representation is constructed from the weighted sum of the depth sequences. This 2D representation is used with a pre-trained CNN model. Features learned from this model and the 3D-CNN model are used in training of the final approach after a feature selection step. Among the various classifiers, an SVM-based model produced the best results. The proposed method was tested on the MSR-Action3D dataset for single-person activities, the SBU dataset for dyadic interactions, and the NTU RGB+D dataset for both types of actions. Experimental results show that proposed 3D and 2D representations and deep features extracted from them are robust and efficient. The proposed method achieves comparable results with the state of the art methods in the literature.