Spherical Vision Transformers for Audio-Visual Saliency Prediction in 360° Videos


Cokelek M., Ozsoy H., Imamoglu N., Ozcinar C., AYHAN İ., Erdem E., ...More

IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.48, no.1, pp.329-345, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Volume: 48 Issue: 1
  • Publication Date: 2026
  • Doi Number: 10.1109/tpami.2025.3604091
  • Journal Name: IEEE Transactions on Pattern Analysis and Machine Intelligence
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, ABI/INFORM, Compendex, INSPEC, MEDLINE, zbMATH
  • Page Numbers: pp.329-345
  • Keywords: 360° videos, adapter fine-tuning, Audio-visual saliency prediction, vision transformers
  • Hacettepe University Affiliated: Yes

Abstract

Omnidirectional videos (ODVs)areredefiningviewer experiences in virtual reality (VR) by offering an unprecedented full field-of-view (FOV). This study extends the domain of saliency prediction to 360◦ environments, addressing the complexities of spherical distortion and the integration of spatial audio. Contextu ally, ODVs have transformed user experience by adding a spatial audio dimension that aligns sound direction with the viewer’s perspective in spherical scenes. Motivated by the lack of com prehensive datasets for 360◦ audio-visual saliency prediction, our study curates YT360-EyeTracking, a new dataset of 81 ODVs, each observed under varying audio-visual conditions. Our goal is to explore how to utilize audio-visual cues to effectively predict visual saliency in 360◦ videos. Towards this aim, we propose two novel saliency prediction models:SalViT360, avision-transformer based framework for ODVs equipped with spherical geometry aware spatio-temporal attention layers, and SalViT360-AV, which further incorporates transformer adapters conditioned on audio input. Our results on a number of benchmark datasets, includ ing our YT360-EyeTracking, demonstrate that SalViT360 and SalViT360-AV significantly outperform existing methods in pre dicting viewer attention in 360◦ scenes. Interpreting these results, wesuggestthatintegratingspatialaudiocuesinthemodelarchitec ture is crucial for accurate saliency prediction in omnidirectional videos.