Artificial intelligence-generated patient information on shoulder instability remains suboptimal: DeepSeek outperforms ChatGPT in completeness of content while ChatGPT is more readable


Öğümsöğütlü E., Bozgeyik B., Huri G.

Knee Surgery, Sports Traumatology, Arthroscopy, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Publication Date: 2026
  • Doi Number: 10.1002/ksa.70335
  • Journal Name: Knee Surgery, Sports Traumatology, Arthroscopy
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
  • Keywords: artificial intelligence, ChatGPT, DeepSeek, shoulder instability
  • Hacettepe University Affiliated: Yes

Abstract

Purpose: This study aimed to evaluate and compare the performance of the Chat Generative Pre-Trained Transformer (ChatGPT) and DeepSeek artificial intelligence (AI) models for patient information on shoulder instability. Methods: Sixteen frequently asked questions related to shoulder instability were posed to both AI models. The models' responses were evaluated for content quality using the Journal of the American Medical Association (JAMA), DISCERN, and 4-point Likert scales. In addition, the readability of the responses was analysed using the Flesch–Kincaid Readability Score (FRES) and Flesch–Kincaid Grade Level (FKGL). Results: None of the models met the JAMA criteria. In the DISCERN scoring, DeepSeek (52.81) scored significantly higher than ChatGPT (48.5) (p = 0.001). While there was no significant difference in the accuracy, clarity, and consistency criteria between the two models in the 4-point Likert evaluation (p > 0.05), DeepSeek scored significantly higher than ChatGPT in the completeness criterion (p = 0.001). In terms of readability, ChatGPT had an average FKGL value of 7.78 and an FRES score of 52.44. The DeepSeek model had an FKGL value of 9.90 and an FRES score of 41.87. There was a statistically significant difference in the readability between the two models (FKGL, p = 0.016; FRES, p = 0.015). Conclusion: Both AI models provided generally accurate and clinically relevant information on shoulder instability patient education despite limitations in transparency and source attribution. The results showed that DeepSeek scored significantly higher in DISCERN and the completeness criterion of the 4-point Likert scale, while there was no significant difference in accuracy, clarity, and consistency. ChatGPT demonstrated better readability. These findings suggest that AI models have the potential to be tools for patient information on shoulder instability, with each model having different strengths. Level of Evidence: Level V.