Malicious Text Identification: Deep Learning from Public Comments and Emails

Uloženo v:
Podrobná bibliografie
Název: Malicious Text Identification: Deep Learning from Public Comments and Emails
Autoři: Asma Baccouche, Sadaf Ahmed, Daniel Sierra-Sosa, Adel Elmaghraby
Zdroj: Information, Vol 11, Iss 312, p 312 (2020)
Informace o vydavateli: MDPI AG
Rok vydání: 2020
Sbírka: Directory of Open Access Journals: DOAJ Articles
Témata: spam text filter, text mining, content-based classification, natural language processing, multi-label classification, LSTM, Information technology, T58.5-58.64
Popis: Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.
Druh dokumentu: article in journal/newspaper
Jazyk: English
Relation: https://www.mdpi.com/2078-2489/11/6/312; https://doaj.org/toc/2078-2489; https://doaj.org/article/3bb5b7176d574560be4aface49bc8aa2
DOI: 10.3390/info11060312
Dostupnost: https://doi.org/10.3390/info11060312
https://doaj.org/article/3bb5b7176d574560be4aface49bc8aa2
Přístupové číslo: edsbas.A4F9FAF9
Databáze: BASE
Popis
Abstrakt:Identifying internet spam has been a challenging problem for decades. Several solutions have succeeded to detect spam comments in social media or fraudulent emails. However, an adequate strategy for filtering messages is difficult to achieve, as these messages resemble real communications. From the Natural Language Processing (NLP) perspective, Deep Learning models are a good alternative for classifying text after being preprocessed. In particular, Long Short-Term Memory (LSTM) networks are one of the models that perform well for the binary and multi-label text classification problems. In this paper, an approach merging two different data sources, one intended for Spam in social media posts and the other for Fraud classification in emails, is presented. We designed a multi-label LSTM model and trained it on the joint datasets including text with common bigrams, extracted from each independent dataset. The experiment results show that our proposed model is capable of identifying malicious text regardless of the source. The LSTM model trained with the merged dataset outperforms the models trained independently on each dataset.
DOI:10.3390/info11060312