Effectively Detecting Content Spam on the Web Using Topical Diversity Measures

Recent studies about web spam detection have utilized various content-based and link-based features to construct a spam classification model. In this paper, we conduct a thorough analysis of content spam on the web using topic models and propose several novel topical diversity measures for content s...

Full description

Saved in:
Bibliographic Details
Published in:2012 IEEE/WIC/ACM International Conferences on Web Intelligence and Intelligent Agent Technology Vol. 1; pp. 266 - 273
Main Authors: Dong, Cailing, Zhou, Bin
Format: Conference Proceeding
Language:English
Published: IEEE 01.12.2012
Subjects:
ISBN:9781467360579, 1467360570
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Recent studies about web spam detection have utilized various content-based and link-based features to construct a spam classification model. In this paper, we conduct a thorough analysis of content spam on the web using topic models and propose several novel topical diversity measures for content spam detection. We adopt the web spam benchmark data set WEBSPAM-UK2007 for evaluation, and the experimental results verify that by integrating our topical diversity measures the performance of the state-of-the-art web spam detection methods can be greatly improved. In addition, comparing to existing features for training spam classification models, our topical diversity measures can achieve high spam detection performance using small set of training data. In personalized web spam detection, the training data (i.e., user's spam labeling results) are typically small. Our finding makes personalized web spam detection highly achievable. We develop an efficient and effective regression model using topical diversity measures for personalized web spam detection, and present some promising results obtained from an empirical study.
ISBN:9781467360579
1467360570
DOI:10.1109/WI-IAT.2012.98