Multi-criteria evaluation of clinical decision-making performance in spinal neurosurgery and physical therapy scenarios: A comparative analysis of artificial intelligence models

Tuncer, Cengiz; TEKİN, RABİA; Uludağ, Veysel; Kılıç, Güven; Taşkesen, Ahmet

doi:10.1007/s00586-026-09795-3

Multi-criteria evaluation of clinical decision-making performance in spinal neurosurgery and physical therapy scenarios: A comparative analysis of artificial intelligence models

Tuncer C., TEKİN R. T., Uludağ V., Kılıç G., Taşkesen A.

European Spine Journal, cilt.35, sa.3, ss.1101-1108, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 35 Sayı: 3
Basım Tarihi: 2026
Doi Numarası: 10.1007/s00586-026-09795-3
Dergi Adı: European Spine Journal
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE
Sayfa Sayıları: ss.1101-1108
Anahtar Kelimeler: Artificial intelligence, Clinical decision support, GPT-4, Physiotherapy, Spinal neurosurgery
Hacettepe Üniversitesi Adresli: Evet

Özet

Background: The integration of AI in healthcare, particularly in clinical decision-making, has shown promising results. This study focuses on evaluating the performance of GPT-4 and GPT-3.5, two advanced AI models, in the context of spinal neurosurgery and physiotherapy, areas that require precise and dynamic decision-making. Methods: We conducted a prospective, observational study with 64 participants, including neurosurgeons and physiotherapists, who evaluated AI-generated responses for 10 detailed clinical scenarios. The assessment criteria included diagnostic accuracy, treatment suitability, surgical technique detail, and rehabilitation planning. Each scenario was meticulously crafted to reflect common yet complex clinical situations. Results: The study revealed that the GPT-4 consistently outperformed the GPT-3.5 across all the evaluated criteria, with the most significant differences observed in treatment suitability and rehabilitation planning. Statistical analyses, including paired t tests and ANOVA, confirmed the superiority of the GPT-4, highlighting its advanced language processing capabilities and broader medical knowledge base. Reliability analyses further supported these findings. Cronbach’s alpha values indicated moderate internal consistency for GPT-4 (α = 0.344) and lower consistency for GPT-3.5 (α = 0.133). Additionally, Cohen’s Kappa values demonstrated moderate agreement for GPT-4 (κ = 0.65) and fair agreement for GPT-3.5 (κ = 0.48), further validating the reliability of the participants’ evaluations. Conclusions: While the GPT-4 has significant potential as a clinical decision support tool, especially in complex and multidisciplinary fields such as spinal neurosurgery and physiotherapy, its recommendations should be carefully integrated with clinical expertise. Further research is essential to enhance its application and ensure that AI can effectively support dynamic medical environments.