TEXT: Automatic Template Extraction from Heterogeneous Web Pages
World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structur...
Saved in:
| Published in: | IEEE transactions on knowledge and data engineering Vol. 23; no. 4; pp. 612 - 626 |
|---|---|
| Main Authors: | , |
| Format: | Journal Article |
| Language: | English |
| Published: |
New York, NY
IEEE
01.04.2011
IEEE Computer Society The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Subjects: | |
| ISSN: | 1041-4347, 1558-2191 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms. |
|---|---|
| AbstractList | World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically populated by using the common templates with contents. The templates provide readers easy access to the contents guided by consistent structures. However, for machines, the templates are considered harmful since they degrade the accuracy and performance of web applications due to irrelevant terms in templates. Thus, template detection techniques have received a lot of attention recently to improve the performance of search engines, clustering, and classification of web documents. In this paper, we present novel algorithms for extracting templates from a large number of web documents which are generated from heterogeneous templates. We cluster the web documents based on the similarity of underlying template structures in the documents so that the template for each cluster is extracted simultaneously. We develop a novel goodness measure with its fast approximation for clustering and provide comprehensive analysis of our algorithm. Our experimental results with real-life data sets confirm the effectiveness and robustness of our algorithm compared to the state of the art for template detection algorithms. |
| Author | Chulyun Kim Kyuseok Shim |
| Author_xml | – sequence: 1 givenname: Chulyun surname: Kim fullname: Kim, Chulyun – sequence: 2 givenname: Kyuseok surname: Shim fullname: Shim, Kyuseok |
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=24363503$$DView record in Pascal Francis |
| BookMark | eNp1kM9LwzAUx4MouKlHT16KIJ6qSfOjrSfHnE4U9FDRW0jT19HRNjPJQP97U6ceBp7ee_D5Pr58xmi3Nz0gdEzwBSE4vywebmYXCR5OhnfQiHCexQnJyW7YMSMxoyzdR2PnlhjjLM3ICF0Xs7fiKpqsvemUb3RUQLdqlYdo9uGt0r4xfVRb00Vz8GDNAnowaxe9Qhk9qwW4Q7RXq9bB0c88QC-3s2I6jx-f7u6nk8dYUyZ8XAqKKwwkr0ugdUWFSlkoTXGpSZWWWRVmkqm0ojlTQoDSnFQ6zxVUQicE6AE63_xdWfO-Budl1zgNbau-C8lMMIZZzlggT7fIpVnbPpSTGU_yRHBOA3T2AymnVVtb1evGyZVtOmU_ZcKooBwPHN1w2hrnLNRSN14NVoKdppUEy8G9HNzLwb0M7kMq3kr9Pv6PP9nwDQD8sZzzNEsZ_QIxDY6i |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_1109_TKDE_2018_2876250 crossref_primary_10_1016_j_is_2014_11_005 crossref_primary_10_4018_IJSI_297994 crossref_primary_10_1177_0165551516666446 crossref_primary_10_1145_3316810 crossref_primary_10_1007_s11277_018_5366_5 crossref_primary_10_1007_s11277_021_08093_z crossref_primary_10_1109_TKDE_2020_3021067 crossref_primary_10_1109_TKDE_2019_2893242 crossref_primary_10_1109_TKDE_2011_238 crossref_primary_10_1088_1757_899X_180_1_012060 |
| Cites_doi | 10.1145/956750.956764 10.1145/335168.335225 10.1145/1007568.1007584 10.1145/1376616.1376637 10.1145/1242572.1242582 10.1145/1062745.1062763 10.1016/j.datak.2004.11.004 10.1145/1060745.1060760 10.1145/1081870.1081949 10.1145/511446.511522 10.1145/335191.335409 10.1142/0822 10.1145/1281192.1281287 10.1006/jcss.1999.1690 10.1002/0471200611 10.1145/988672.988740 10.1016/0005-1098(78)90005-5 10.1145/1060745.1060761 10.1145/872757.872799 10.1145/1183614.1183654 |
| ContentType | Journal Article |
| Copyright | 2015 INIST-CNRS Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2011 |
| Copyright_xml | – notice: 2015 INIST-CNRS – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) Apr 2011 |
| DBID | 97E RIA RIE AAYXX CITATION IQODW 7SC 7SP 8FD JQ2 L7M L~C L~D F28 FR3 |
| DOI | 10.1109/TKDE.2010.140 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Pascal-Francis Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ANTE: Abstracts in New Technology & Engineering Engineering Research Database |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional Engineering Research Database ANTE: Abstracts in New Technology & Engineering |
| DatabaseTitleList | Technology Research Database Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science Applied Sciences |
| EISSN | 1558-2191 |
| EndPage | 626 |
| ExternalDocumentID | 2272040771 24363503 10_1109_TKDE_2010_140 5557874 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 1OL 29I 4.4 5GY 5VS 6IK 97E 9M8 AAJGR AARMG AASAJ AAWTH ABAZT ABFSI ABQJQ ABVLG ACGFO ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 E.L EBS EJD F5P HZ~ H~9 ICLAB IEDLZ IFIPE IFJZH IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNI RNS RXW RZB TAE TAF TN5 UHB VH1 AAYXX CITATION IQODW RIG 7SC 7SP 8FD JQ2 L7M L~C L~D F28 FR3 |
| ID | FETCH-LOGICAL-c346t-b630d0e19fbe3fd36a7411030bc1d7b8dbc128a7d394a66eac51dc99aed6c21e3 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 21 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000287586100010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Sun Sep 28 06:39:26 EDT 2025 Mon Jun 30 06:58:35 EDT 2025 Mon Jul 21 09:10:45 EDT 2025 Sat Nov 29 08:06:48 EST 2025 Tue Nov 18 22:18:25 EST 2025 Wed Aug 27 02:52:15 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 4 |
| Keywords | Electronic document Information extraction Information source Classification World wide web Robustness Document analysis Pattern extraction Productivity Content access Template extraction Useful information Search engine Cluster Information retrieval Text Case based reasoning minimum description length principle Document structure Minimum principle clustering Internet Web site MinHash Algorithm analysis |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html CC BY 4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c346t-b630d0e19fbe3fd36a7411030bc1d7b8dbc128a7d394a66eac51dc99aed6c21e3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 ObjectType-Article-2 ObjectType-Feature-1 content type line 23 |
| PQID | 852926553 |
| PQPubID | 85438 |
| PageCount | 15 |
| ParticipantIDs | pascalfrancis_primary_24363503 crossref_citationtrail_10_1109_TKDE_2010_140 ieee_primary_5557874 crossref_primary_10_1109_TKDE_2010_140 proquest_journals_852926553 proquest_miscellaneous_864404944 |
| PublicationCentury | 2000 |
| PublicationDate | 2011-04-01 |
| PublicationDateYYYYMMDD | 2011-04-01 |
| PublicationDate_xml | – month: 04 year: 2011 text: 2011-04-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | New York, NY |
| PublicationPlace_xml | – name: New York, NY – name: New York |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2011 |
| Publisher | IEEE IEEE Computer Society The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: IEEE Computer Society – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | ref13 ref12 ref15 ref14 ref11 Crescenzi (ref10) ref17 ref16 ref18 (ref1) 2010 Cho (ref8) Zhao (ref25) ref24 ref23 ref26 ref20 ref22 ref21 (ref2) 2010 ref7 Plumbley (ref19) 2002 ref9 ref4 ref3 ref6 ref5 |
| References_xml | – volume-title: Document Object Model (dom) Level 1 Specification Version 1.0 year: 2010 ident: ref1 – ident: ref13 doi: 10.1145/956750.956764 – ident: ref7 doi: 10.1145/335168.335225 – ident: ref16 doi: 10.1145/1007568.1007584 – ident: ref18 doi: 10.1145/1376616.1376637 – ident: ref6 doi: 10.1145/1242572.1242582 – ident: ref15 doi: 10.1145/1062745.1062763 – ident: ref11 doi: 10.1016/j.datak.2004.11.004 – ident: ref24 doi: 10.1145/1060745.1060760 – ident: ref17 doi: 10.1145/1081870.1081949 – volume-title: Proc. 32nd Int’l Conf. Very Large Data Bases (VLDB) ident: ref25 article-title: Automatic Extraction of Dynamic Record Sections from Search Engine Result Pages – year: 2002 ident: ref19 article-title: Clustering of Sparse Binary Data Using a Minimum Description Length Approach – ident: ref4 doi: 10.1145/511446.511522 – volume-title: Proc. Int’l Conf. Very Large Data Bases (VLDB) ident: ref8 article-title: Rankmass Crawler: A Crawler with High Personalized Pagerank Coverage Guarantee – ident: ref14 doi: 10.1145/335191.335409 – ident: ref21 doi: 10.1142/0822 – ident: ref26 doi: 10.1145/1281192.1281287 – ident: ref5 doi: 10.1006/jcss.1999.1690 – volume-title: Proc. 27th Int’l Conf. Very Large Data Bases (VLDB) ident: ref10 article-title: Roadrunner: Towards Automatic Data Extraction from Large Web Sites – volume-title: Xpath Specification year: 2010 ident: ref2 – ident: ref9 doi: 10.1002/0471200611 – ident: ref12 doi: 10.1145/988672.988740 – ident: ref20 doi: 10.1016/0005-1098(78)90005-5 – ident: ref23 doi: 10.1145/1060745.1060761 – ident: ref3 doi: 10.1145/872757.872799 – ident: ref22 doi: 10.1145/1183614.1183654 |
| SSID | ssj0008781 |
| Score | 2.1671216 |
| Snippet | World Wide Web is the most useful source of information. In order to achieve high productivity of publishing, the webpages in many websites are automatically... |
| SourceID | proquest pascalfrancis crossref ieee |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 612 |
| SubjectTerms | Algorithms Applied sciences Artificial intelligence Clustering Clustering algorithms Clusters Computer science; control theory; systems Computer systems and distributed systems. User interface Data mining Data models Data processing. List processing. Character string processing Exact sciences and technology HTML Information systems. Data bases Memory organisation. Data processing Merging MinHash minimum description length principle Readers Search engines Similarity Software Speech and sound recognition and synthesis. Linguistics Studies Template extraction Web pages Websites World Wide Web XML |
| Title | TEXT: Automatic Template Extraction from Heterogeneous Web Pages |
| URI | https://ieeexplore.ieee.org/document/5557874 https://www.proquest.com/docview/852926553 https://www.proquest.com/docview/864404944 |
| Volume | 23 |
| WOSCitedRecordID | wos000287586100010&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2191 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0008781 issn: 1041-4347 databaseCode: RIE dateStart: 19890101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NT90wDLcAcdgOwGDTHl_KYdqJjqZJ88EJxB5CYkIcOu3dqjRxpUnsFb32If58krSvA2077NRIsZTKdhw7sf0D-ORPJIWVZAkVWiRcZzZRuWAJZ7Ji1JhUuYha8k3e3qrZTN-twclYC4OIMfkMv4RhfMt3jV2Gq7LTPA_6xddhXUrR12qNVlfJCEjqowsfEzEuf_fTPC1uvk77JC4a7jhenD8RUCWkQ5rWc6TuoSz-sMrxqLna_r-f3IGtwaUkF70OvIM1nO_C9gqugQy7dxfevug9uAfnxXRWnJGLZdfErq2kwF8P997zJNOnbtGXO5BQfEKuQ8ZM4xUNm2VLfmBF7rwRat_D96tpcXmdDHAKiWVcdEklWOpSpLqukNWOCeO9iYAyVlnqZKWc_2bKSMc0N0J4i5xTZ7U26ITNKLIPsDFv5vgRCEqkyGt0inlWm9Qog9w7JrZGvwqjEzhZMbm0Q6_xAHlxX8aYI9VlkEkZZOJjj3QCn0fyh77Jxr8I9wLDR6KB1xM4fiXBcT7jzDtUKZvAwUqk5bBH21Llmc5EnvtZMs76zRVeTExkaalEaJ-oOd__-8IH8Ka_Yw6ZPIew0S2WeASb9rH72S6Oo4I-A3Dg4z0 |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Rb9QwDLbGhsR4YLAx7RgbeUA8raxp0jThiYnddNOO0x6KuLcqTVwJabtO1960n0-S9soQ8MBTI8VSKttx7MT2B_DenUgSy4xFVCgRcZWYSKaCRZxlJaNax9IG1JJpNpvJ-Vxdb8DJUAuDiCH5DD_6YXjLt7VZ-auy0zT1-sWfwJZHzlJdtdZgd2UWIEldfOGiIsazXx01T_Or83GXxkX9LcejEyhAqviESN04nlQdmMUfdjkcNhc7__ebL-FF71SSs04LXsEGLnZhZw3YQPr9uwvPH3Uf3IPP-XiefyJnq7YOfVtJjrd3N873JOOHdtkVPBBffkImPmemdqqG9aoh37Ek184MNa_h28U4_zKJekCFyDAu2qgULLYxUlWVyCrLhHb-hMcZKw21WSmt-yZSZ5YproVwNjml1iil0QqTUGT7sLmoF3gABDOkyCu0kjlW61hLjdy5JqZCtwqjIzhZM7kwfbdxD3pxU4SoI1aFl0nhZeKij3gEHwbyu67Nxr8I9zzDB6Ke1yM4_k2Cw3zCmXOpYjaCw7VIi36XNoVME5WINHWzZJh128u_mejA0kIK30BRcf7m7wu_g2eT_Ou0mF7Org5hu7tx9nk9b2GzXa7wCJ6a-_ZHszwOyvoTe0LmjA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TEXT%3A+Automatic+Template+Extraction+from+Heterogeneous+Web+Pages&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Chulyun+Kim&rft.au=Kyuseok+Shim&rft.date=2011-04-01&rft.pub=IEEE&rft.issn=1041-4347&rft.volume=23&rft.issue=4&rft.spage=612&rft.epage=626&rft_id=info:doi/10.1109%2FTKDE.2010.140&rft.externalDocID=5557874 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |