Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline

GÜNEŞLİ, IRMAK; Aksun, Seren; Fathelbab, Jana; YILDIZ, OKAN

doi:10.1007/s12020-024-04121-7

Comparative evaluation of ChatGPT-4, ChatGPT-3.5 and Google Gemini on PCOS assessment and management based on recommendations from the 2023 guideline

Atıf İçin Kopyala

GÜNEŞLİ I., Aksun S., Fathelbab J., YILDIZ O. B.

ENDOCRINE, 2024 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Basım Tarihi: 2024
Doi Numarası: 10.1007/s12020-024-04121-7
Dergi Adı: ENDOCRINE
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, CAB Abstracts, Chemical Abstracts Core, EMBASE, MEDLINE, Veterinary Science Database
Hacettepe Üniversitesi Adresli: Evet

Özet

Context Artificial intelligence (AI) is increasingly utilized in healthcare, with models like ChatGPT and Google Gemini gaining global popularity. Polycystic ovary syndrome (PCOS) is a prevalent condition that requires both lifestyle modifications and medical treatment, highlighting the critical need for effective patient education. This study compares the responses of ChatGPT-4, ChatGPT-3.5 and Gemini to PCOS-related questions using the latest guideline. Evaluating AI's integration into patient education necessitates assessing response quality, reliability, readability and effectiveness in managing PCOS. Purpose To evaluate the accuracy, quality, readability and tendency to hallucinate of ChatGPT-4, ChatGPT-3.5 and Gemini's responses to questions about PCOS, its assessment and management based on recommendations from the current international PCOS guideline. Methods This cross-sectional study assessed ChatGPT-4, ChatGPT-3.5, and Gemini's responses to PCOS-related questions created by endocrinologists using the latest guidelines and common patient queries. Experts evaluated the responses for accuracy, quality and tendency to hallucinate using Likert scales, while readability was analyzed using standard formulas. Results ChatGPT-4 and ChatGPT-3.5 attained higher scores in accuracy and quality compared to Gemini (p = 0.001, p < 0.001 and p = 0.007, p < 0.001 respectively). However, Gemini obtained a higher readability score compared to the other chatbots (p < 0.001). There was a significant difference between the tendency to hallucinate scores, which were due to the lower scores in Gemini (p = 0.003). Conclusion The high accuracy and quality of responses provided by ChatGPT-4 and 3.5 to questions about PCOS suggest that they could be supportive in clinical practice. Future technological advancements may facilitate the use of artificial intelligence in both educating patients with PCOS and supporting the management of the disorder.