GramBeddings: A New Neural Network for URL Based Identification of Phishing Web Pages Through N-gram Embeddings


BOZKIR A. S., Dalgic F. C., AYDOS M.

Computers and Security, vol.124, 2023 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 124
  • Publication Date: 2023
  • Doi Number: 10.1016/j.cose.2022.102964
  • Journal Name: Computers and Security
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, PASCAL, ABI/INFORM, Aerospace Database, Applied Science & Technology Source, Business Source Elite, Business Source Premier, Communication Abstracts, Computer & Applied Sciences, Criminal Justice Abstracts, INSPEC, Metadex, Civil Engineering Abstracts
  • Hacettepe University Affiliated: Yes

Abstract

© 2022 Elsevier LtdThere has been ever-growing use of Internet and progress within many communication channels such as social media and this escalates the need for rapid and low source demanding phishing detection mechanisms. In this very study, we propose a new deep neural model for phishing URL identification so-called GramBeddings introducing some distinguishing novelties by (1) proposing the use of n-gram embeddings, computed on the fly, requiring no pre-training stage, (2) removing the necessity of word and sub-word level information, (3) providing a smart and efficient n-gram selection pipeline, and benefiting from attention mechanism. Other than that, we share a publicly available, large-scale and novel dataset including 800K real-world phishing and legitimate URLs. Our scheme suggests an adjustable and automated n-gram selection and filtering mechanism along with a new neural network architecture concatenating four-channel information flow through cascading CNN, LSTM, and attention layers. With that, discriminative multi-level character patterns can be discovered without any hand-crafted operation and are enabled to contribute to prediction. As a result, the proposed system provides the following features in the problem domain: (i) real-time, end-to-end and high performance inference, (ii) language-agnostic prediction, and (iii) removal of the necessity of any third-party service or hand-crafted feature. These experiments show that our approach outperforms the other models in the literature with an accuracy of 98.27%. Moreover, the comparative study conducted with several datasets clearly verifies the superiority of our model in all tests. We also examine the robustness of our model against a real-world adversarial attack and discuss the methods of overcoming such an attack. Our codebase is shared with the community to be used for benchmarking purposes in the future.