Multi-criteria evaluation of clinical decision-making performance in spinal neurosurgery and physical therapy scenarios: A comparative analysis of artificial intelligence models


Creative Commons License

Tuncer C., TEKİN R. T., Uludağ V., Kılıç G., Taşkesen A.

European Spine Journal, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Publication Date: 2026
  • Doi Number: 10.1007/s00586-026-09795-3
  • Journal Name: European Spine Journal
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, EMBASE, MEDLINE
  • Keywords: Artificial intelligence, Clinical decision support, GPT-4, Physiotherapy, Spinal neurosurgery
  • Hacettepe University Affiliated: Yes

Abstract

Background: The integration of AI in healthcare, particularly in clinical decision-making, has shown promising results. This study focuses on evaluating the performance of GPT-4 and GPT-3.5, two advanced AI models, in the context of spinal neurosurgery and physiotherapy, areas that require precise and dynamic decision-making. Methods: We conducted a prospective, observational study with 64 participants, including neurosurgeons and physiotherapists, who evaluated AI-generated responses for 10 detailed clinical scenarios. The assessment criteria included diagnostic accuracy, treatment suitability, surgical technique detail, and rehabilitation planning. Each scenario was meticulously crafted to reflect common yet complex clinical situations. Results: The study revealed that the GPT-4 consistently outperformed the GPT-3.5 across all the evaluated criteria, with the most significant differences observed in treatment suitability and rehabilitation planning. Statistical analyses, including paired t tests and ANOVA, confirmed the superiority of the GPT-4, highlighting its advanced language processing capabilities and broader medical knowledge base. Reliability analyses further supported these findings. Cronbach’s alpha values indicated moderate internal consistency for GPT-4 (α = 0.344) and lower consistency for GPT-3.5 (α = 0.133). Additionally, Cohen’s Kappa values demonstrated moderate agreement for GPT-4 (κ = 0.65) and fair agreement for GPT-3.5 (κ = 0.48), further validating the reliability of the participants’ evaluations. Conclusions: While the GPT-4 has significant potential as a clinical decision support tool, especially in complex and multidisciplinary fields such as spinal neurosurgery and physiotherapy, its recommendations should be carefully integrated with clinical expertise. Further research is essential to enhance its application and ensure that AI can effectively support dynamic medical environments.