How well do vision models understand tasks with multiple labels?

Can Bilge, Yunus

doi:10.1016/j.eswa.2026.131479

How well do vision models understand tasks with multiple labels?

Can Bilge Y.

Expert Systems with Applications, cilt.314, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 314
Basım Tarihi: 2026
Doi Numarası: 10.1016/j.eswa.2026.131479
Dergi Adı: Expert Systems with Applications
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Public Affairs Index
Anahtar Kelimeler: Image attribute recognition, Multi-label learning, Transfer learning
Hacettepe Üniversitesi Adresli: Evet

Özet

The increasing availability of pre-trained vision backbones has significantly advanced multi-label image classification, yet the comparative transferability and generalization behavior of these models across diverse target domains remain underexplored. In this study, we present a comprehensive empirical analysis of 80 pre-trained backbones, evaluated in a consistent setting across five benchmark datasets: MS-COCO, NUS-WIDE, CelebA, PA-100K, and MS-COCO-2012. While the architectures and benchmarks used in our study are established, our work provides the first large-scale, standardized analysis of backbone transferability in multi-label settings, offering practical insights and reproducible tools that are currently lacking in the literature and remain highly relevant for real-world deployment and benchmarking. Using a standardized multi-label image classification framework and seven evaluation metrics, we systematically assess the performance, robustness, and efficiency of each model. We investigate the influence of object scale, dataset diversity and size, classifier depth, and relationship between evaluation metrics, and evaluate the alignment of them. We further observe that accuracy and recall metrics are strongly aligned, while instance-level precision behaves more independently, suggesting the need for a measure for backbone selection. To support it, we introduce TAME and TAMEeff, composite scoring strategies that account for predictive performance and model efficiency. Our findings provide actionable insights and a composite metric and efficiency analysis to guide backbone selection in multi-label settings in real-world and resource-constrained multi-label applications. All model outputs, evaluation scripts, and diagnostics will be publicly available to support reproducibility and further research.