Assessing the informatics utility of artificial intelligence chatbots for patient and clinician support: the case of adrenal incidentaloma


Basmaci N., Aksun S., YILDIZ O. B.

Endocrine, vol.91, no.1, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Publication Type: Article / Article
  • Volume: 91 Issue: 1
  • Publication Date: 2026
  • Doi Number: 10.1007/s12020-025-04491-6
  • Journal Name: Endocrine
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Chemical Abstracts Core, EMBASE, MEDLINE
  • Keywords: Adrenal incidentaloma, AI chatbots, Artificial intelligence (AI), Clinical decision support, Deep learning in medicine, Natural language processing (NLP)
  • Hacettepe University Affiliated: Yes

Abstract

Objective: The use of artificial intelligence (AI) in clinical medicine is expanding rapidly. Over the past two decades, the incidence of adrenal tumors has increased tenfold, primarily due to the widespread use of imaging, leading to a surge in the detection of adrenal incidentalomas. This comparative study assessed the performance of leading AI chatbots in addressing patient and physician oriented questions on adrenal incidentalomas, focusing on accuracy, quality, hallucination tendency, and readability. Methods: In this study, the performance of four AI chatbots, ChatGPT 4.0, Gemini 2.0 Flash, DeepSeek, and Perplexity, was evaluated. A total of 35 questions, categorized into “patient education” and “physician support,” were submitted to each chatbot. Responses were independently assessed for accuracy, quality, and hallucination tendency using a Likert scale. Readability was analyzed using standard metrics. Results: DeepSeek demonstrated the highest performance in both accuracy and response quality, outperforming the other chatbots (p < 0.001). ChatGPT 4.0 performed better than both Gemini 2.0 Flash and Perplexity (p < 0.001), while no significant difference was observed between Gemini 2.0 Flash and Perplexity. Perplexity exhibited a higher tendency to hallucinate compared to the other models (p < 0.001). In terms of readability, Perplexity was the least readable model, whereas ChatGPT 4.0 achieved highest readability scores based on both FRE (Flesch Reading Ease) and FKGL (Flesch-Kincaid Grade Level) metrics. Conclusion: AI chatbots can serve as valuable tools for disseminating information about adrenal incidentalomas, particularly in the context of patient education. Among the evaluated models, DeepSeek tended to provide comparatively higher accuracy and quality, with a lower tendency to hallucinate. In contrast, Perplexity demonstrated comparatively lower performance across multiple metrics.