Characteristics of Understanding URLs and Domain Names Features: The Detection of Phishing Websites with Machine Learning Methods

Kara, Ilker; Ok, Murathan; Ozaday, Ahmet

doi:10.1109/access.2022.3223111

Characteristics of Understanding URLs and Domain Names Features: The Detection of Phishing Websites with Machine Learning Methods

Atıf İçin Kopyala

Kara I., Ok M., Ozaday A.

IEEE Access, cilt.10, ss.124420-124428, 2022 (SCI-Expanded)

Yayın Türü: Makale / Tam Makale
Cilt numarası: 10
Basım Tarihi: 2022
Doi Numarası: 10.1109/access.2022.3223111
Dergi Adı: IEEE Access
Derginin Tarandığı İndeksler: Science Citation Index Expanded (SCI-EXPANDED), Scopus, Compendex, INSPEC, Directory of Open Access Journals
Sayfa Sayıları: ss.124420-124428
Anahtar Kelimeler: Phishing, Uniform resource locators, Machine learning, Classification algorithms, Machine learning algorithms, Support vector machines, Feature extraction, Computer security, Cyber-security, website features, phishing, feature extraction, machine learning
Hacettepe Üniversitesi Adresli: Evet

Özet

© 2013 IEEE.Along with the means of communication, it has also prompted the birth of more harmful, and challenging websites in the device of information systems, and electronics. According to current estimates, you can deal with a huge budget to arrange detailed information on attackers. Furthermore, only those that are handled similarly to HTML, DOM, and URL based features in the literature are easily manipulated by attackers. To respond to these attacks, we propose a new method that detects phishing websites by categorizing the Internet URL, and domain names of websites with six different classifier algorithms according to eleven predetermined features. For this method, we created a previously unused list. The list was obtained by analyzing an index created with information obtained from internationally reputable intelligence services, and entire organizations. The proposed method simplifies the process of feature extraction, and reduces processing overhead while going beyond analyzing on HTML, DOM, and URL based features by considering URLs, and domain names. To illustrate the highest accuracy rate among six different classification results, we preferred to use the Random Forest algorithm. In this study, we use a dataset with 32,928 data in which 12,134 data without phishing websites, and 20,614 data with phishing websites to be labeled according to eleven predetermined features. Our experimental results show that phishing websites can be detected with as much as 98.90% accuracy with our proposed method. As a result, it has been demonstrated that RF descriptors with SVM representation can be utilized to accurately mark phishing web pages. In addition, characteristic updates can be followed with a continuously updated source.