Meta-Entity Driven Triplet Mining for Aligning Medical Vision-Language Models


Yilmaz M. B., ÖZTÜRK Ş., Kara M., Yavuz M. T., Gumeler E., Koc A., ...Daha Fazla

IEEE Journal of Biomedical and Health Informatics, 2026 (SCI-Expanded, Scopus) identifier identifier

  • Yayın Türü: Makale / Tam Makale
  • Basım Tarihi: 2026
  • Doi Numarası: 10.1109/jbhi.2026.3678306
  • Dergi Adı: IEEE Journal of Biomedical and Health Informatics
  • Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, BIOSIS, Compendex, EMBASE, INSPEC, MEDLINE
  • Anahtar Kelimeler: multimodal alignment, radiography, triplet mining, vision-language model
  • Hacettepe Üniversitesi Adresli: Evet

Özet

Imaging-based diagnostics rely on evaluation of both medical images and radiology reports, but increasing data volumes strain medical experts, leading to errors and workflow delays. Medical vision-language models (med-VLMs) offer an efficient approach for processing multimodal imaging data, especially for chest X-rays (CXRs), though their success depends on effective image-text alignment. Existing alignment methods for med-VLMs, primarily based on contrastive learning, often focus on coarse-grained disease class separation and overlook fine-grained pathology attributes such as location, size, or severity, which results in suboptimal representations. We introduce MedTrim (Meta-entity-driven Triplet mining), a novel alignment method that improves precision via structured triplet learning guided by meta-entities extracted from radiology reports. Unlike conventional contrastive and triplet frameworks for representational learning that rely on global class labels or implicit similarity references, MedTrim explicitly models hierarchical relationships between pathology attributes to preserve clinically meaningful intra-class variation. To do this, MedTrim leverages a domain-specific ontology to identify adjectival qualifiers and directional descriptors of pathology, a novel entity-aware triplet mining score to capture hierarchical inter-sample similarity, and a multimodal alignment objective that enforces consistency across image-text pairs sharing detailed pathology attributes without compromising within-modality relationships. MedTrim improves performance in downstream retrieval, classification, and generation tasks compared to leading alignment methods.