GastroGPT: Development and controlled testing of a proof-of-concept customized clinical language model


Simsek C., Üçdal M. T., De-Madaria E., Ebigbo A., Vanek P., Elshaarawy O., ...Daha Fazla

ENDOSCOPY INTERNATIONAL OPEN, cilt.7, sa.1, ss.1-10, 2025 (ESCI) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Cilt numarası: 7 Sayı: 1
  • Basım Tarihi: 2025
  • Doi Numarası: 10.1055/a-2637-2163
  • Dergi Adı: ENDOSCOPY INTERNATIONAL OPEN
  • Derginin Tarandığı İndeksler: Emerging Sources Citation Index (ESCI), EMBASE, Directory of Open Access Journals
  • Sayfa Sayıları: ss.1-10
  • Hacettepe Üniversitesi Adresli: Evet

Özet

Background and aims: Current general-purpose artificial intelligence (AI) large language models (LLMs) demonstrate limited efficacy in clinical medicine, often constrained to question-answering, documentation, and literature summarisation roles. We developed GastroGPT, a proof-of-concept specialty-specific, multi-task clinical LLM, and evaluated its performance against leading general-purpose LLMs across key gastroenterology tasks and diverse case scenarios. Methods: In this structured analysis, GastroGPT was compared to three state-of-the-art general-purpose LLMs (LLM-A: GPT-4, LLM-B: Bard, LLM-C: Claude). Models were assessed on seven clinical tasks and overall performance across ten simulated gastroenterology cases varying in complexity, frequency, and patient demographics. Standardised prompts facilitated structured comparisons. A blinded expert panel rated model outputs per task on a 10-point Likert scale, judging clinical utility. Comprehensive statistical analyses were conducted. Results: A total of 2,240 expert ratings were obtained. GastroGPT achieved significantly higher mean overall scores (8.1±1.8) compared to GPT-4 (5.2±3.0), Bard (5.7±3.3), and Claude (7.0±2.7) (all p<0.001). It outperformed comparators in 6 of 7 tasks (p<0.05), except follow-up planning. GastroGPT demonstrated superior score consistency (variance 34.95) versus general models (97.4-260.35) (p<0.001). Its performance remained consistent across case complexities and frequencies, unlike the comparators (p<0.001). Multivariate analysis revealed that model type significantly predicted performance (p<0.001).Conclusion: This study pioneers the development and comparison of a specialty-specific, clinically-oriented AI model to general-purpose LLMs. GastroGPT demonstrated superior utility overall and on key gastroenterology tasks, highlighting the potential for tailored, task-focused AI models in medicine. Keywords: Artificial Intelligence, Large Language Models, GastroGPT, Clinical Decision Support, Gastroenterology