An effective hot topic detection method for microblog on spark

[Display omitted] •We propose an efficient method called parallel two-phase mic-mac hot topic detection (TMHTD) to detect the hot topics in microblogs in a large dataset.•We design three optimization methods for TMHTD to improve the accuracy of hot topic detection.•We import the TMHTD method and the...

Full description

Saved in:
Bibliographic Details
Published in:Applied soft computing Vol. 70; pp. 1010 - 1023
Main Authors: Ai, Wei, Li, Kenli, Li, Keqin
Format: Journal Article
Language:English
Published: Elsevier B.V 01.09.2018
Subjects:
ISSN:1568-4946, 1872-9681
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:[Display omitted] •We propose an efficient method called parallel two-phase mic-mac hot topic detection (TMHTD) to detect the hot topics in microblogs in a large dataset.•We design three optimization methods for TMHTD to improve the accuracy of hot topic detection.•We import the TMHTD method and the related algorithms into the Spark cloud computing environment.•We evaluate the performance of the proposed solution through extensive experiments under different data set sizes and varying Spark platform configurations. Extensive experimental results indicate that the accuracy and performance of the TMHTD algorithm can be improved significantly. With the emergence of the big data age, methods for quickly and accurately obtaining valuable hot topics from the vast amount of digitized textual material have attracted much attention. In this work, we focus on topic detection in microblogs in the big data environment. Different from existing approaches, we solve this problem in a distributed way. Specifically, we propose a non-iterative algorithm called parallel two-phase mic-mac hot topic detection (TMHTD), and implement it in the Apache Spark environment. The proposed TMHTD method includes two phases, i.e., the micro-clustering phase and the macro-clustering phase. To improve the accuracy of hot topic detection, three optimization methods, along with TMHTD, are proposed. To handle large databases, we deliberately design a group of MapReduce jobs to concretely accomplish hot topic detection in a highly scalable way. We compare the TMHTD algorithm with the general single-pass algorithm and the Latent Dirichlet Allocation (LDA) algorithm. Our experiments are carried out on real-life data sets gathered from the Sina Weibo API. Extensive experimental results indicate that the accuracy and performance of the TMHTD algorithm are significant improvements over previous methods. More specifically, the F-measure value of the TMHTD algorithm shows a 6% and 8% improvement over the general single-pass algorithm and the LDA algorithm, respectively. The run time of the TMHTD algorithm is 7 times and twice as superior to the general single-pass algorithm and the LDA algorithm, respectively.
ISSN:1568-4946
1872-9681
DOI:10.1016/j.asoc.2017.08.053