CrossPhire: Benefiting Multimodality for Robust Phishing Web Page Identification


Almakhamreh A. H. A., BOZKIR A. S.

Applied Sciences (Switzerland), vol.16, no.2, 2026 (SCI-Expanded, Scopus) identifier

  • Publication Type: Article / Article
  • Volume: 16 Issue: 2
  • Publication Date: 2026
  • Doi Number: 10.3390/app16020751
  • Journal Name: Applied Sciences (Switzerland)
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Keywords: computer vision, cybersecurity, information security, machine learning, multimodality, phishing detection
  • Hacettepe University Affiliated: Yes

Abstract

Phishing attacks continue to evolve and exploit fundamental human impulses, such as trust and the need for a rapid response, as well as emotional triggers. This makes the human mind both a valuable asset and a significant vulnerability. The proliferation of zero-day vulnerabilities has been identified as a significant exacerbating factor in this threat landscape. To address these evolving challenges, we introduce CrossPhire: a multimodal deep learning framework with an end-to-end architecture that captures semantic and visual cues from multiple data modalities, while also providing methodological insights for anti-phishing multimodal learning. First, we demonstrate that markup-free semantic text encoding captures linguistic deception patterns more effectively than DOM-based approaches, achieving 96–97% accuracy using textual content alone and providing the strongest single-modality signal through sentence transformers applied to HTML text stripped of structural markup. Second, through controlled comparison of fusion strategies, we show that simple concatenation outperforms a sophisticated gating mechanism so-called Mixture-of-Experts by 0.5–10% when modalities provide complementary, non-redundant security evidence. We validate these insights through rigorous experimentation on five datasets, achieving competitive same-dataset performance (97.96–100%) while demonstrating promising cross-dataset generalization (85–96% accuracy under distribution shift). Additionally, we contribute Phish360, a rigorously curated multimodal benchmark with 10,748 samples addressing quality issues in existing datasets (96.63% unique phishing HTML vs. 16–61% in prior benchmarks), and provide LIME-based explainability tools that decompose predictions into modality-specific contributions. The rapid inference time (0.08 s) and high accuracy results position CrossPhire as a promising solution in the fight against phishing attacks.