Performance of ChatGPT-4o, Gemini Advanced-2.0, and DeepSeek-V3 in the diagnosis and management of temporomandibular disorders


Bilgin Avsar D., ERTAN A. A.

BMC Oral Health, vol.26, no.1, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Volume: 26 Issue: 1
  • Publication Date: 2026
  • Doi Number: 10.1186/s12903-025-07415-y
  • Journal Name: BMC Oral Health
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE, Directory of Open Access Journals
  • Keywords: Artificial intelligence, Large language model, Masticatory muscle disorders, Temporomandibular disorders, Temporomandibular joint
  • Hacettepe University Affiliated: Yes

Abstract

Background: This study evaluated the performance of three large language models (LLMs): ChatGPT-4o (OpenAI, San Francisco, CA, USA), Gemini Advanced-2.0 (Google AI, Mountain View, CA, USA), and DeepSeek-V3 (DeepSeek AI, Hangzhou, China), in diagnosing and recommending managements for temporomandibular joint and masticatory muscle disorders. Materials and methods: 30 clinical scenarios (15 temporomandibular joint-related, 15 masticatory muscle-related) were designed based on literature and clinical expertise. The cases were submitted to each LLM via their respective web platforms. Responses were rated separately for diagnosis and management using a 3-point Likert scale (0 = incorrect, 1 = partially correct, 2 = correct). Two prosthodontists with 5 and 30 years of clinical experience independently scored each response. Data were statistically analyzed, and differences were considered significant at p < 0.05. Results: No statistically significant differences were observed in diagnostic scores. However, ChatGPT-4o achieved the highest total scores, with significantly better management performance compared to Gemini Advanced-2.0 and DeepSeek-V3 (p < 0.05). The evaluators’ ratings were consistent in most cases. Conclusions: Among the evaluated LLMs, ChatGPT-4o demonstrated the highest overall performance, providing more accurate management recommendations compared with the other models. However, the diagnostic reliability varied across different cases. These findings indicate that LLMs may serve as supportive tools in clinical decision-making, particularly for generating preliminary insights. Nevertheless, their current limitations underscore the importance of case-specific professional oversight, and further studies involving real patient cases are essential to better assess their practical reliability.