TEXT: Automatic Template Extraction from Heterogeneous Web Pages

World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structur...

Full description

Saved in:
Bibliographic Details
Published in:IEEE transactions on knowledge and data engineering Vol. 23; no. 4; pp. 612 - 626
Main Authors: Kim, Chulyun, Shim, Kyuseok
Format: Journal Article
Language:English
Published: New York, NY IEEE 01.04.2011
IEEE Computer Society
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:
ISSN:1041-4347, 1558-2191
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.
AbstractList World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms.
Author Chulyun Kim
Kyuseok Shim
Author_xml – sequence: 1
  givenname: Chulyun
  surname: Kim
  fullname: Kim, Chulyun
– sequence: 2
  givenname: Kyuseok
  surname: Shim
  fullname: Shim, Kyuseok
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24363503$$DView record in Pascal Francis
BookMark eNp1kM9LwzAUx4MouKlHT16KIJ6qSfOjrSfHnE4U9FDRW0jT19HRNjPJQP97U6ceBp7ee_D5Pr58xmi3Nz0gdEzwBSE4vywebmYXCR5OhnfQiHCexQnJyW7YMSMxoyzdR2PnlhjjLM3ICF0Xs7fiKpqsvemUb3RUQLdqlYdo9uGt0r4xfVRb00Vz8GDNAnowaxe9Qhk9qwW4Q7RXq9bB0c88QC-3s2I6jx-f7u6nk8dYUyZ8XAqKKwwkr0ugdUWFSlkoTXGpSZWWWRVmkqm0ojlTQoDSnFQ6zxVUQicE6AE63_xdWfO-Budl1zgNbau-C8lMMIZZzlggT7fIpVnbPpSTGU_yRHBOA3T2AymnVVtb1evGyZVtOmU_ZcKooBwPHN1w2hrnLNRSN14NVoKdppUEy8G9HNzLwb0M7kMq3kr9Pv6PP9nwDQD8sZzzNEsZ_QIxDY6i
CODEN ITKEEH
CitedBy_id crossref_primary_10_1109_TKDE_2018_2876250
crossref_primary_10_1016_j_is_2014_11_005
crossref_primary_10_4018_IJSI_297994
crossref_primary_10_1177_0165551516666446
crossref_primary_10_1145_3316810
crossref_primary_10_1007_s11277_018_5366_5
crossref_primary_10_1007_s11277_021_08093_z
crossref_primary_10_1109_TKDE_2020_3021067
crossref_primary_10_1109_TKDE_2019_2893242
crossref_primary_10_1109_TKDE_2011_238
crossref_primary_10_1088_1757_899X_180_1_012060
Cites_doi 10.1145/956750.956764
10.1145/335168.335225
10.1145/1007568.1007584
10.1145/1376616.1376637
10.1145/1242572.1242582
10.1145/1062745.1062763
10.1016/j.datak.2004.11.004
10.1145/1060745.1060760
10.1145/1081870.1081949
10.1145/511446.511522
10.1145/335191.335409
10.1142/0822
10.1145/1281192.1281287
10.1006/jcss.1999.1690
10.1002/0471200611
10.1145/988672.988740
10.1016/0005-1098(78)90005-5
10.1145/1060745.1060761
10.1145/872757.872799
10.1145/1183614.1183654
ContentType Journal Article
Copyright 2015 INIST-CNRS
Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2011
Copyright_xml – notice: 2015 INIST-CNRS
– notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2011
DBID 97E
RIA
RIE
AAYXX
CITATION
IQODW
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
F28
FR3
DOI 10.1109/TKDE.2010.140
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Pascal-Francis
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
Engineering Research Database
ANTE: Abstracts in New Technology & Engineering
DatabaseTitleList Technology Research Database
Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
Applied Sciences
EISSN 1558-2191
EndPage 626
ExternalDocumentID 2272040771
24363503
10_1109_TKDE_2010_140
5557874
Genre orig-research
GroupedDBID -~X
.DC
0R~
1OL
29I
4.4
5GY
5VS
6IK
97E
9M8
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABFSI
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
E.L
EBS
EJD
F5P
HZ~
H~9
ICLAB
IEDLZ
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNI
RNS
RXW
RZB
TAE
TAF
TN5
UHB
VH1
AAYXX
CITATION
IQODW
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
F28
FR3
ID FETCH-LOGICAL-c346t-b630d0e19fbe3fd36a7411030bc1d7b8dbc128a7d394a66eac51dc99aed6c21e3
IEDL.DBID RIE
ISICitedReferencesCount 21
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000287586100010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1041-4347
IngestDate Sun Sep 28 06:39:26 EDT 2025
Mon Jun 30 06:58:35 EDT 2025
Mon Jul 21 09:10:45 EDT 2025
Sat Nov 29 08:06:48 EST 2025
Tue Nov 18 22:18:25 EST 2025
Wed Aug 27 02:52:15 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords Electronic document
Information extraction
Information source
Classification
World wide web
Robustness
Document analysis
Pattern extraction
Productivity
Content access
Template extraction
Useful information
Search engine
Cluster
Information retrieval
Text
Case based reasoning
minimum description length principle
Document structure
Minimum principle
clustering
Internet
Web site
MinHash
Algorithm analysis
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
CC BY 4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c346t-b630d0e19fbe3fd36a7411030bc1d7b8dbc128a7d394a66eac51dc99aed6c21e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ObjectType-Article-2
ObjectType-Feature-1
content type line 23
PQID 852926553
PQPubID 85438
PageCount 15
ParticipantIDs pascalfrancis_primary_24363503
crossref_citationtrail_10_1109_TKDE_2010_140
ieee_primary_5557874
crossref_primary_10_1109_TKDE_2010_140
proquest_journals_852926553
proquest_miscellaneous_864404944
PublicationCentury 2000
PublicationDate 2011-04-01
PublicationDateYYYYMMDD 2011-04-01
PublicationDate_xml – month: 04
  year: 2011
  text: 2011-04-01
  day: 01
PublicationDecade 2010
PublicationPlace New York, NY
PublicationPlace_xml – name: New York, NY
– name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2011
Publisher IEEE
IEEE Computer Society
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: IEEE Computer Society
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
ref14
ref11
Crescenzi (ref10)
ref17
ref16
ref18
(ref1) 2010
Cho (ref8)
Zhao (ref25)
ref24
ref23
ref26
ref20
ref22
ref21
(ref2) 2010
ref7
Plumbley (ref19) 2002
ref9
ref4
ref3
ref6
ref5
References_xml – volume-title: Document Object Model (dom) Level 1 Specification Version 1.0
  year: 2010
  ident: ref1
– ident: ref13
  doi: 10.1145/956750.956764
– ident: ref7
  doi: 10.1145/335168.335225
– ident: ref16
  doi: 10.1145/1007568.1007584
– ident: ref18
  doi: 10.1145/1376616.1376637
– ident: ref6
  doi: 10.1145/1242572.1242582
– ident: ref15
  doi: 10.1145/1062745.1062763
– ident: ref11
  doi: 10.1016/j.datak.2004.11.004
– ident: ref24
  doi: 10.1145/1060745.1060760
– ident: ref17
  doi: 10.1145/1081870.1081949
– volume-title: Proc. 32nd Int’l Conf. Very Large Data Bases (VLDB)
  ident: ref25
  article-title: Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages
– year: 2002
  ident: ref19
  article-title: Clustering of Sparse Binary Data Using a Minimum Description Length Approach
– ident: ref4
  doi: 10.1145/511446.511522
– volume-title: Proc. Int’l Conf. Very Large Data Bases (VLDB)
  ident: ref8
  article-title: Rankmass Crawler: A Crawler with High Personalized Pagerank Coverage Guarantee
– ident: ref14
  doi: 10.1145/335191.335409
– ident: ref21
  doi: 10.1142/0822
– ident: ref26
  doi: 10.1145/1281192.1281287
– ident: ref5
  doi: 10.1006/jcss.1999.1690
– volume-title: Proc. 27th Int’l Conf. Very Large Data Bases (VLDB)
  ident: ref10
  article-title: Roadrunner: Towards Automatic Data Extraction from Large Web Sites
– volume-title: Xpath Specification
  year: 2010
  ident: ref2
– ident: ref9
  doi: 10.1002/0471200611
– ident: ref12
  doi: 10.1145/988672.988740
– ident: ref20
  doi: 10.1016/0005-1098(78)90005-5
– ident: ref23
  doi: 10.1145/1060745.1060761
– ident: ref3
  doi: 10.1145/872757.872799
– ident: ref22
  doi: 10.1145/1183614.1183654
SSID ssj0008781
Score 2.1671216
Snippet World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically...
SourceID proquest
pascalfrancis
crossref
ieee
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 612
SubjectTerms Algorithms
Applied sciences
Artificial intelligence
Clustering
Clustering algorithms
Clusters
Computer science; control theory; systems
Computer systems and distributed systems. User interface
Data mining
Data models
Data processing. List processing. Character string processing
Exact sciences and technology
HTML
Information systems. Data bases
Memory organisation. Data processing
Merging
MinHash
minimum description length principle
Readers
Search engines
Similarity
Software
Speech and sound recognition and synthesis. Linguistics
Studies
Template extraction
Web pages
Websites
World Wide Web
XML
Title TEXT: Automatic Template Extraction from Heterogeneous Web Pages
URI https://ieeexplore.ieee.org/document/5557874
https://www.proquest.com/docview/852926553
https://www.proquest.com/docview/864404944
Volume 23
WOSCitedRecordID wos000287586100010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2191
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0008781
  issn: 1041-4347
  databaseCode: RIE
  dateStart: 19890101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT90wDLcAcdgOwGDTHl_KYdqJjqZJ88EJxB5CYkIcOu3dqjRxpUnsFb32If58krSvA2077NRIsZTKdhw7sf0D-ORPJIWVZAkVWiRcZzZRuWAJZ7Ji1JhUuYha8k3e3qrZTN-twclYC4OIMfkMv4RhfMt3jV2Gq7LTPA_6xddhXUrR12qNVlfJCEjqowsfEzEuf_fTPC1uvk77JC4a7jhenD8RUCWkQ5rWc6TuoSz-sMrxqLna_r-f3IGtwaUkF70OvIM1nO_C9gqugQy7dxfevug9uAfnxXRWnJGLZdfErq2kwF8P997zJNOnbtGXO5BQfEKuQ8ZM4xUNm2VLfmBF7rwRat_D96tpcXmdDHAKiWVcdEklWOpSpLqukNWOCeO9iYAyVlnqZKWc_2bKSMc0N0J4i5xTZ7U26ITNKLIPsDFv5vgRCEqkyGt0inlWm9Qog9w7JrZGvwqjEzhZMbm0Q6_xAHlxX8aYI9VlkEkZZOJjj3QCn0fyh77Jxr8I9wLDR6KB1xM4fiXBcT7jzDtUKZvAwUqk5bBH21Llmc5EnvtZMs76zRVeTExkaalEaJ-oOd__-8IH8Ka_Yw6ZPIew0S2WeASb9rH72S6Oo4I-A3Dg4z0
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Rb9QwDLbGhsR4YLAx7RgbeUA8raxp0jThiYnddNOO0x6KuLcqTVwJabtO1960n0-S9soQ8MBTI8VSKttx7MT2B_DenUgSy4xFVCgRcZWYSKaCRZxlJaNax9IG1JJpNpvJ-Vxdb8DJUAuDiCH5DD_6YXjLt7VZ-auy0zT1-sWfwJZHzlJdtdZgd2UWIEldfOGiIsazXx01T_Or83GXxkX9LcejEyhAqviESN04nlQdmMUfdjkcNhc7__ebL-FF71SSs04LXsEGLnZhZw3YQPr9uwvPH3Uf3IPP-XiefyJnq7YOfVtJjrd3N873JOOHdtkVPBBffkImPmemdqqG9aoh37Ek184MNa_h28U4_zKJekCFyDAu2qgULLYxUlWVyCrLhHb-hMcZKw21WSmt-yZSZ5YproVwNjml1iil0QqTUGT7sLmoF3gABDOkyCu0kjlW61hLjdy5JqZCtwqjIzhZM7kwfbdxD3pxU4SoI1aFl0nhZeKij3gEHwbyu67Nxr8I9zzDB6Ke1yM4_k2Cw3zCmXOpYjaCw7VIi36XNoVME5WINHWzZJh128u_mejA0kIK30BRcf7m7wu_g2eT_Ou0mF7Org5hu7tx9nk9b2GzXa7wCJ6a-_ZHszwOyvoTe0LmjA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TEXT%3A+Automatic+Template+Extraction+from+Heterogeneous+Web+Pages&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Chulyun+Kim&rft.au=Kyuseok+Shim&rft.date=2011-04-01&rft.pub=IEEE&rft.issn=1041-4347&rft.volume=23&rft.issue=4&rft.spage=612&rft.epage=626&rft_id=info:doi/10.1109%2FTKDE.2010.140&rft.externalDocID=5557874
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon