Characteristics of Understanding URLs and Domain Names Features: The Detection of Phishing Websites with Machine Learning Methods


Creative Commons License

Kara I., Ok M., Ozaday A.

IEEE Access, vol.10, pp.124420-124428, 2022 (SCI-Expanded) identifier identifier

  • Publication Type: Article / Article
  • Volume: 10
  • Publication Date: 2022
  • Doi Number: 10.1109/access.2022.3223111
  • Journal Name: IEEE Access
  • Journal Indexes: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
  • Page Numbers: pp.124420-124428
  • Keywords: Phishing, Uniform resource locators, Machine learning, Classification algorithms, Machine learning algorithms, Support vector machines, Feature extraction, Computer security, Cyber-security, website features, phishing, feature extraction, machine learning
  • Hacettepe University Affiliated: Yes

Abstract

© 2013 IEEE.Along with the means of communication, it has also prompted the birth of more harmful, and challenging websites in the device of information systems, and electronics. According to current estimates, you can deal with a huge budget to arrange detailed information on attackers. Furthermore, only those that are handled similarly to HTML, DOM, and URL based features in the literature are easily manipulated by attackers. To respond to these attacks, we propose a new method that detects phishing websites by categorizing the Internet URL, and domain names of websites with six different classifier algorithms according to eleven predetermined features. For this method, we created a previously unused list. The list was obtained by analyzing an index created with information obtained from internationally reputable intelligence services, and entire organizations. The proposed method simplifies the process of feature extraction, and reduces processing overhead while going beyond analyzing on HTML, DOM, and URL based features by considering URLs, and domain names. To illustrate the highest accuracy rate among six different classification results, we preferred to use the Random Forest algorithm. In this study, we use a dataset with 32,928 data in which 12,134 data without phishing websites, and 20,614 data with phishing websites to be labeled according to eleven predetermined features. Our experimental results show that phishing websites can be detected with as much as 98.90% accuracy with our proposed method. As a result, it has been demonstrated that RF descriptors with SVM representation can be utilized to accurately mark phishing web pages. In addition, characteristic updates can be followed with a continuously updated source.