TEXT: Automatic Template Extraction from Heterogeneous Web Pages

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structur...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:IEEE transactions on knowledge and data engineering Jg. 23; H. 4; S. 612 - 626
Hauptverfasser: Kim, Chulyun, Shim, Kyuseok
Format: Journal Article
Sprache:Englisch
Veröffentlicht: New York, NY IEEE 01.04.2011
IEEE Computer Society
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:1041-4347, 1558-2191
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-1
content type line 23
ISSN:1041-4347
1558-2191
DOI:10.1109/TKDE.2010.140