ENDOSCOPY, vol.57, no.S 02, pp.45, 2025 (SCI-Expanded)
Aims This study aims to evaluate GPT-2 language model's processing capacity of medical statements in endoscopy and gastroenterology, and to assess the attention mechanism between word pairs across different categories. Methods The GPT-2 model processed 100 statements, comprising 50 true and 50 false, which included endoscopic findings, diagnoses, and treatments. Words in the medical statements were categorized into six primary categories: endoscopic findings, diagnoses, treatments, laboratory findings, clinical symptoms, and general medical terms. Attention scores between words in each statement, ranging between 0 and 1, were analyzed for all possible combinations of the six word categories to provide insight into the model's ability to understand different clinical relationships. The correlation between statement length and mean attention score was analyzed using Pearson's correlation coefficient. Attention scores were compared between correctly and incorrectly classified statements. Mean attention scores for different word pair categories were calculated separately for true and false statements to identify patterns in the model's attention distribution. Statistical analyses included Pearson's correlation coefficient for length-score relationships, two-tailed t-tests for comparing attention scores, a 95% confidence interval for all measurements, and a significance threshold set at p<0.05 Results The GPT-2 model correctly classified 42 (84%) of 50 true statements and 38 (76%) of 50 false statements. In true statements, the highest attention scores were observed in endoscopic finding-diagnosis pairs (0.89, e.g., ulcer-gastritis), diagnosis-treatment pairs (0.85, e.g., esophagitis-PPI), and endoscopic finding-treatment pairs (0.82, e.g., bleeding-sclerotherapy). In false statements, these scores were 0.81, 0.79, and 0.76, respectively. The model misclassified 8 true statements, showing low attention scores particularly in endoscopic finding-treatment (0.61, e.g., varices-band ligation) and diagnosis-endoscopic finding (0.58, e.g., ulcer-bleeding) pairs. Twelve false statements were misclassified despite high attention scores (endoscopic finding-diagnosis: 0.77, e.g., polyp-gastritis). Laboratory-related word pairs showed moderate attention scores (diagnosis-laboratory: 0.78, e.g., ulcer-hemoglobin). The lowest attention scores were observed in non-specific word matches (0.35 in true statements, 0.28 in false statements). A negative correlation was found between statement length and attention score (r=-0.38, p<0.001). Statements with 6-8 words showed higher average attention scores (0.72) compared to statements with 9-12 words (0.64). Conclusions This study demonstrates that the GPT-2 model shows high attention to word pairs involving endoscopic finding-diagnosis and diagnosis-treatment relationships when processing gastroenterology and endoscopy statements. The model exhibits higher attention scores in true statements and shows more consistent attention patterns in shorter statements.