BMC Oral Health, vol.26, no.1, 2026 (SCI-Expanded, Scopus)
Background: This study evaluated the performance of three large language models (LLMs): ChatGPT-4o (OpenAI, San Francisco, CA, USA), Gemini Advanced-2.0 (Google AI, Mountain View, CA, USA), and DeepSeek-V3 (DeepSeek AI, Hangzhou, China), in diagnosing and recommending managements for temporomandibular joint and masticatory muscle disorders. Materials and methods: 30 clinical scenarios (15 temporomandibular joint-related, 15 masticatory muscle-related) were designed based on literature and clinical expertise. The cases were submitted to each LLM via their respective web platforms. Responses were rated separately for diagnosis and management using a 3-point Likert scale (0 = incorrect, 1 = partially correct, 2 = correct). Two prosthodontists with 5 and 30 years of clinical experience independently scored each response. Data were statistically analyzed, and differences were considered significant at p < 0.05. Results: No statistically significant differences were observed in diagnostic scores. However, ChatGPT-4o achieved the highest total scores, with significantly better management performance compared to Gemini Advanced-2.0 and DeepSeek-V3 (p < 0.05). The evaluators’ ratings were consistent in most cases. Conclusions: Among the evaluated LLMs, ChatGPT-4o demonstrated the highest overall performance, providing more accurate management recommendations compared with the other models. However, the diagnostic reliability varied across different cases. These findings indicate that LLMs may serve as supportive tools in clinical decision-making, particularly for generating preliminary insights. Nevertheless, their current limitations underscore the importance of case-specific professional oversight, and further studies involving real patient cases are essential to better assess their practical reliability.