Machine learning based phishing detection from URLs

•Use of 7 different classification algorithms and NLP based features.•A Big URL Data Set is produced and shared (36,400 legitimate and 37,175 phishing).•Real-time and language-independent classification algorithms.•Feature-rich classifiers with Word Vectors, NLP-based and Hybrid features.•The propos...

Ausführliche Beschreibung

Gespeichert in:

Bibliographische Detailangaben
Veröffentlicht in:	Expert systems with applications Jg. 117; S. 345 - 357
Hauptverfasser:	Sahingoz, Ozgur Koray, Buber, Ebubekir, Demir, Onder, Diri, Banu
Format:	Journal Article
Sprache:	Englisch
Veröffentlicht:	New York Elsevier Ltd 01.03.2019 Elsevier BV
Schlagworte:	Accuracy Algorithms Artificial intelligence Classification Classification algorithms Cyber attack detection Cyber security Cybersecurity Internet Machine learning Natural language processing Phishing Phishing attack Privacy Real time Semantics Software Software reliability Websites Phishing attack Classification algorithms Cyber attack detection Cyber security Machine learning
ISSN:	0957-4174, 1873-6793
Online-Zugang:	Volltext
Tags:	Tag hinzufügen Keine Tags, Fügen Sie den ersten Tag hinzu!

Beschreibung
Zusammenfassung:	•Use of 7 different classification algorithms and NLP based features.•A Big URL Data Set is produced and shared (36,400 legitimate and 37,175 phishing).•Real-time and language-independent classification algorithms.•Feature-rich classifiers with Word Vectors, NLP-based and Hybrid features.•The proposed approach reaches 97.98% accuracy rate. Due to the rapid growth of the Internet, users change their preference from traditional shopping to the electronic commerce. Instead of bank/shop robbery, nowadays, criminals try to find their victims in the cyberspace with some specific tricks. By using the anonymous structure of the Internet, attackers set out new techniques, such as phishing, to deceive victims with the use of false websites to collect their sensitive information such as account IDs, usernames, passwords, etc. Understanding whether a web page is legitimate or phishing is a very challenging problem, due to its semantics-based attack structure, which mainly exploits the computer users’ vulnerabilities. Although software companies launch new anti-phishing products, which use blacklists, heuristics, visual and machine learning-based approaches, these products cannot prevent all of the phishing attacks. In this paper, a real-time anti-phishing system, which uses seven different classification algorithms and natural language processing (NLP) based features, is proposed. The system has the following distinguishing properties from other studies in the literature: language independence, use of a huge size of phishing and legitimate data, real-time execution, detection of new websites, independence from third-party services and use of feature-rich classifiers. For measuring the performance of the system, a new dataset is constructed, and the experimental results are tested on it. According to the experimental and comparative results from the implemented classification algorithms, Random Forest algorithm with only NLP based features gives the best performance with the 97.98% accuracy rate for detection of phishing URLs.
Bibliographie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0957-4174 1873-6793
DOI:	10.1016/j.eswa.2018.09.029