A new feature selection algorithm based on binomial hypothesis testing for spam filtering
Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which ut...
Uloženo v:
| Vydáno v: | Knowledge-Based Systems Ročník 24; číslo 6; s. 904 - 914 |
|---|---|
| Hlavní autoři: | , , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina japonština |
| Vydáno: |
Elsevier B.V
01.08.2011
Elsevier BV |
| Témata: | |
| ISSN: | 0950-7051, 1872-7409 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | Content-based spam filtering is a binary text categorization problem. To improve the performance of the spam filtering, feature selection, as an important and indispensable means of text categorization, also plays an important role in spam filtering. We proposed a new method, named Bi-Test, which utilizes binomial hypothesis testing to estimate whether the probability of a feature belonging to the spam satisfies a given threshold or not. We have evaluated Bi-Test on six benchmark spam corpora (pu1, pu2, pu3, pua, lingspam and CSDMC2010), using two classification algorithms, Naïve Bayes (NB) and Support Vector Machines (SVM), and compared it with four famous feature selection algorithms (information gain,
χ
2-statistic, improved Gini index and Poisson distribution). The experiments show that Bi-Test performs significantly better than
χ
2-statistic and Poisson distribution, and produces comparable performance with information gain and improved Gini index in terms of
F1 measure when Naïve Bayes classifier is used; it achieves comparable performance with the other methods when SVM classifier is used. Moreover, Bi-Test executes faster than the other four algorithms. |
|---|---|
| Bibliografie: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
| ISSN: | 0950-7051 1872-7409 |
| DOI: | 10.1016/j.knosys.2011.04.006 |