Artificial intelligence-generated patient information on shoulder instability remains suboptimal: DeepSeek outperforms ChatGPT in completeness of content while ChatGPT is more readable

Öğümsöğütlü, Erman; Bozgeyik, Bahri; Huri, GAZİ

doi:10.1002/ksa.70335

Artificial intelligence-generated patient information on shoulder instability remains suboptimal: DeepSeek outperforms ChatGPT in completeness of content while ChatGPT is more readable

Öğümsöğütlü E., Bozgeyik B., Huri G.

Knee Surgery, Sports Traumatology, Arthroscopy, cilt.34, sa.5, ss.1894-1900, 2026 (SCI-Expanded, Scopus)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 34 Sayı: 5
Basım Tarihi: 2026
Doi Numarası: 10.1002/ksa.70335
Dergi Adı: Knee Surgery, Sports Traumatology, Arthroscopy
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, CINAHL, MEDLINE
Sayfa Sayıları: ss.1894-1900
Anahtar Kelimeler: artificial intelligence, ChatGPT, DeepSeek, shoulder instability
Hacettepe Üniversitesi Adresli: Evet

Özet

Purpose: This study aimed to evaluate and compare the performance of the Chat Generative Pre-Trained Transformer (ChatGPT) and DeepSeek artificial intelligence (AI) models for patient information on shoulder instability. Methods: Sixteen frequently asked questions related to shoulder instability were posed to both AI models. The models' responses were evaluated for content quality using the Journal of the American Medical Association (JAMA), DISCERN, and 4-point Likert scales. In addition, the readability of the responses was analysed using the Flesch–Kincaid Readability Score (FRES) and Flesch–Kincaid Grade Level (FKGL). Results: None of the models met the JAMA criteria. In the DISCERN scoring, DeepSeek (52.81) scored significantly higher than ChatGPT (48.5) (p = 0.001). While there was no significant difference in the accuracy, clarity, and consistency criteria between the two models in the 4-point Likert evaluation (p > 0.05), DeepSeek scored significantly higher than ChatGPT in the completeness criterion (p = 0.001). In terms of readability, ChatGPT had an average FKGL value of 7.78 and an FRES score of 52.44. The DeepSeek model had an FKGL value of 9.90 and an FRES score of 41.87. There was a statistically significant difference in the readability between the two models (FKGL, p = 0.016; FRES, p = 0.015). Conclusion: Both AI models provided generally accurate and clinically relevant information on shoulder instability patient education despite limitations in transparency and source attribution. The results showed that DeepSeek scored significantly higher in DISCERN and the completeness criterion of the 4-point Likert scale, while there was no significant difference in accuracy, clarity, and consistency. ChatGPT demonstrated better readability. These findings suggest that AI models have the potential to be tools for patient information on shoulder instability, with each model having different strengths. Level of Evidence: Level V.