European Archives of Oto-Rhino-Laryngology, 2026 (SCI-Expanded, Scopus)
Purpose: Artificial intelligence (AI) has become increasingly integrated into clinical medicine, with large language models (LLMs) showing growing potential for diagnostic and therapeutic reasoning. This study aimed to compare the diagnostic and therapeutic performance of four distinct LLMs—ChatGPT-5, Gemini 2.5 Pro, Claude Sonnet-4, and DeepSeek-R1—within otolaryngology. Methods: A total of 100 real patient cases representing multiple otolaryngologic subspecialties were evaluated by each model. Two board-certified otolaryngologists, independently and blindly reviewed and scored all diagnostic and therapeutic outputs using a structured 0–1–2 scale, with consensus scores serving as the gold-standard reference. Results: All models achieved high diagnostic accuracy, with diagnostic accuracy rates of 99% for ChatGPT-5, 98% for Gemini 2.5 Pro, 94% for Claude Sonnet-4, and 92% for DeepSeek-R1. Therapeutic accuracy was comparatively lower, with Gemini 2.5 Pro achieving the highest rate of correct recommendations (86%), followed by DeepSeek-R1 and Claude Sonnet-4 (69% each), and ChatGPT-5 (64%). Overall differences in therapeutic performance among the four models were statistically significant (p = 0.0001). In pairwise comparisons, Gemini 2.5 Pro showed a statistically significant advantage over the other models. For each model, diagnostic performance was significantly higher than therapeutic performance (p ≤ 0.001 for each comparison). Conclusion: Large language models show great potential as clinical decision-support tools in otolaryngology, with Gemini 2.5 Pro demonstrating the most consistent and accurate therapeutic performance. However, the variability in treatment recommendations among models underscores the need for further refinement and continuous human oversight to ensure their safe and effective integration into clinical practice.