Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360° Videos

Cokelek, Mert; Ozsoy, Halit; Imamoglu, Nevrez; Ozcinar, Cagri; AYHAN, İNCİ; Erdem, Erkut; Erdem, İBRAHİM

doi:10.1109/tpami.2025.3604091

Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360° Videos

Cokelek M., Ozsoy H., Imamoglu N., Ozcinar C., AYHAN İ., Erdem E., ...Daha Fazla

IEEE Transactions on Pattern Analysis and Machine Intelligence, cilt.48, sa.1, ss.329-345, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 48 Sayı: 1
Basım Tarihi: 2026
Doi Numarası: 10.1109/tpami.2025.3604091
Dergi Adı: IEEE Transactions on Pattern Analysis and Machine Intelligence
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, ABI/INFORM, Compendex, INSPEC, MEDLINE, zbMATH
Sayfa Sayıları: ss.329-345
Anahtar Kelimeler: 360° videos, adapter fine-tuning, Audio-visual saliency prediction, vision transformers
Hacettepe Üniversitesi Adresli: Evet

Özet

Omnidirectional videos (ODVs)areredefiningviewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360◦ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextu ally, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of com prehensive datasets for 360◦ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360◦ videos. Towards this aim, we propose two novel saliency prediction models:SalViT360, avision-transformer based framework for ODVs equipped with spherical geometry aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, includ ing our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in pre dicting viewer attention in 360◦ scenes. Interpreting these results, wesuggestthatintegratingspatialaudiocuesinthemodelarchitec ture is crucial for accurate saliency prediction in omnidirectional videos.