Discovering Newsworthy Themes from Sequenced Data: A Step Towards Computational Journalism

Automatic discovery of newsworthy themes from sequenced data can relieve journalists from manually poring over a large amount of data in order to find interesting news. In this paper, we propose a novel <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-g...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on knowledge and data engineering Vol. 29; no. 7; pp. 1398 - 1411
Main Authors: Fan, Qi, Li, Yuchen, Zhang, Dongxiang, Tan, Kian-Lee
Format: Journal Article
Language:English
Published: New York IEEE 01.07.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1041-4347, 1558-2191
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Automatic discovery of newsworthy themes from sequenced data can relieve journalists from manually poring over a large amount of data in order to find interesting news. In this paper, we propose a novel <inline-formula> <tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="fan-ieq1-2685587.gif"/> </inline-formula>-Sketch query that aims to find <inline-formula><tex-math notation="LaTeX"> k</tex-math> <inline-graphic xlink:href="fan-ieq2-2685587.gif"/> </inline-formula> striking streaks to best summarize a subject. Our scoring function takes into account streak strikingness and streak coverage at the same time. We study the <inline-formula><tex-math notation="LaTeX"> k</tex-math> <inline-graphic xlink:href="fan-ieq3-2685587.gif"/> </inline-formula>-Sketch query processing in both offline and online scenarios, and propose various streak-level pruning techniques to find striking candidates. Among those candidates, we then develop approximate methods to discover the <inline-formula><tex-math notation="LaTeX">k</tex-math> <inline-graphic xlink:href="fan-ieq4-2685587.gif"/> </inline-formula> most representative streaks with theoretical bounds. We conduct experiments on four real datasets, and the results demonstrate the efficiency and effectiveness of our proposed algorithms: the running time achieves up to 500 times speedup and the quality of the generated summaries is endorsed by the anonymous users from Amazon Mechanical Turk.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2017.2685587