Applied Sciences (Switzerland), vol.16, no.2, 2026 (SCI-Expanded, Scopus)
Phishing attacks continue to evolve and exploit fundamental human impulses, such as trust and the need for a rapid response, as well as emotional triggers. This makes the human mind both a valuable asset and a significant vulnerability. The proliferation of zero-day vulnerabilities has been identified as a significant exacerbating factor in this threat landscape. To address these evolving challenges, we introduce CrossPhire: a multimodal deep learning framework with an end-to-end architecture that captures semantic and visual cues from multiple data modalities, while also providing methodological insights for anti-phishing multimodal learning. First, we demonstrate that markup-free semantic text encoding captures linguistic deception patterns more effectively than DOM-based approaches, achieving 96–97% accuracy using textual content alone and providing the strongest single-modality signal through sentence transformers applied to HTML text stripped of structural markup. Second, through controlled comparison of fusion strategies, we show that simple concatenation outperforms a sophisticated gating mechanism so-called Mixture-of-Experts by 0.5–10% when modalities provide complementary, non-redundant security evidence. We validate these insights through rigorous experimentation on five datasets, achieving competitive same-dataset performance (97.96–100%) while demonstrating promising cross-dataset generalization (85–96% accuracy under distribution shift). Additionally, we contribute Phish360, a rigorously curated multimodal benchmark with 10,748 samples addressing quality issues in existing datasets (96.63% unique phishing HTML vs. 16–61% in prior benchmarks), and provide LIME-based explainability tools that decompose predictions into modality-specific contributions. The rapid inference time (0.08 s) and high accuracy results position CrossPhire as a promising solution in the fight against phishing attacks.