ACTA OTO-LARYNGOLOGICA, cilt.1, sa.1, ss.1-8, 2025 (SCI-Expanded, Scopus)
Background: Large language models (LLMs) are increasingly used in clinical decision-making and patient education, including in complex specialties such as head and neck cancer (HNC). Objective: To evaluate the performance of ChatGPT-4o and DeepSeek-V3 in answering HNC-related clinical questions. Methods: Aset of 154 questions across six clinical categories was submitted twice to both models. Responses were independently graded by head and neck surgeons using a four-point accuracy scale. Accuracy, reproducibility, and inter-model agreement were assessed. Results: ChatGPT-4o and DeepSeek-V3 provided ‘’comprehensive/correct’’ answers in 92.2% and 89.6% of cases, respectively (p = .42). The accuracy ratings of both models’ responses overlapped in 85.1% of cases; however, the statistical agreement between them remained low (Cohen’s κ = 0.12; ICC= 0.21, p = .006). DeepSeek-V3 outperformed ChatGPTin Treatment category (96.3% vs. 81.5%, p = .08), while ChatGPTexcelled in Recovery, Complications, and Follow-up (95.0% vs. 82.5%, p = .08); however, these differences did not reach statistical significance. Reproducibility was high for both models (ChatGPT-4o: 96.1%; DeepSeek-V3: 96.8%). Conclusions: Both models demonstrated strong accuracy and consistency in HNC-related queries. Significance: LLMs hold promise as reliable tools in clinical decision-making and patient education within HNCs when used with careful consideration of their inherent limitations.