Comparative evaluation of ChatGPT-4o and DeepSeek-V3 in head and neck oncology

Tellioğlu, BURÇAY; Pamuk, AHMET; Külekçi, Muhammed; Pamuk, GÖZDE; Sütay Süslü, NİLDA; Kuşcu, OĞUZ

doi:10.1080/00016489.2025.2563035

Comparative evaluation of ChatGPT-4o and DeepSeek-V3 in head and neck oncology

Tellioğlu B., Pamuk A. E., Külekçi M. Ç., Pamuk G., Sütay Süslü N., Kuşcu O.

ACTA OTO-LARYNGOLOGICA, cilt.1, sa.1, ss.1-8, 2025 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 1 Sayı: 1
Basım Tarihi: 2025
Doi Numarası: 10.1080/00016489.2025.2563035
Dergi Adı: ACTA OTO-LARYNGOLOGICA
Derginin Tarandığı İndeksler: Scopus, Science Citation Index Expanded (SCI-EXPANDED), Academic Search Premier, International Bibliography of Social Sciences, Biotechnology Research Abstracts, CINAHL, EMBASE, CAB Abstracts, Linguistics & Language Behavior Abstracts, Veterinary Science Database
Sayfa Sayıları: ss.1-8
Hacettepe Üniversitesi Adresli: Evet

Özet

Background: Large language models (LLMs) are increasingly used in clinical decision-making and patient education, including in complex specialties such as head and neck cancer (HNC). Objective: To evaluate the performance of ChatGPT-4o and DeepSeek-V3 in answering HNC-related clinical questions. Methods: Aset of 154 questions across six clinical categories was submitted twice to both models. Responses were independently graded by head and neck surgeons using a four-point accuracy scale. Accuracy, reproducibility, and inter-model agreement were assessed. Results: ChatGPT-4o and DeepSeek-V3 provided ‘’comprehensive/correct’’ answers in 92.2% and 89.6% of cases, respectively (p = .42). The accuracy ratings of both models’ responses overlapped in 85.1% of cases; however, the statistical agreement between them remained low (Cohen’s κ = 0.12; ICC= 0.21, p = .006). DeepSeek-V3 outperformed ChatGPTin Treatment category (96.3% vs. 81.5%, p = .08), while ChatGPTexcelled in Recovery, Complications, and Follow-up (95.0% vs. 82.5%, p = .08); however, these differences did not reach statistical significance. Reproducibility was high for both models (ChatGPT-4o: 96.1%; DeepSeek-V3: 96.8%). Conclusions: Both models demonstrated strong accuracy and consistency in HNC-related queries. Significance: LLMs hold promise as reliable tools in clinical decision-making and patient education within HNCs when used with careful consideration of their inherent limitations.