Leveraging semantic saliency maps for query-specific video summarization

Cizmeciler, Kemal; Erdem, MEHMET; Erdem, İBRAHİM

doi:10.1007/s11042-022-12442-w

Leveraging semantic saliency maps for query-specific video summarization

Atıf İçin Kopyala

Cizmeciler K., Erdem E., Erdem A.

MULTIMEDIA TOOLS AND APPLICATIONS, cilt.81, sa.12, ss.17457-17482, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 81 Sayı: 12
Basım Tarihi: 2022
Doi Numarası: 10.1007/s11042-022-12442-w
Dergi Adı: MULTIMEDIA TOOLS AND APPLICATIONS
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, FRANCIS, ABI/INFORM, Applied Science & Technology Source, Compendex, Computer & Applied Sciences, INSPEC, zbMATH
Sayfa Sayıları: ss.17457-17482
Anahtar Kelimeler: Query-specific, Video summarization, EGOCENTRIC VIDEO, SCENE
Hacettepe Üniversitesi Adresli: Evet

Özet

The immense amount of videos being uploaded to video sharing platforms makes it impossible for a person to watch all the videos understand what happens in them. Hence, machine learning techniques are now deployed to index videos by recognizing key objects, actions and scenes or places. Summarization is another alternative as it offers to extract only important parts while covering the gist of the video content. Ideally, the user may prefer to analyze a certain action or scene by searching a query term within the video. Current summarization methods generally do not take queries into account or require exhaustive data labeling. In this work, we present a weakly supervised query-focused video summarization method. Our proposed approach makes use of semantic attributes as an indicator of query relevance and semantic attention maps to locate related regions in the frames and utilizes both within a submodular maximization framework. We conducted experiments on the recently introduced RAD dataset and obtained highly competitive results. Moreover, to better evaluate the performance of our approach on longer videos, we collected a new dataset, which consists of 10 videos from YouTube and annotated with shot-level multiple attributes. Our dataset enables much diverse set of queries that can be used to summarize a video from different perspectives with more degrees of freedom.