xCrawl: a high-recall crawling method for Web mining
Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of t...
Uloženo v:
| Vydáno v: | Knowledge and information systems Ročník 25; číslo 2; s. 303 - 326 |
|---|---|
| Hlavní autoři: | , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
London
Springer-Verlag
01.11.2010
Springer Springer Nature B.V |
| Témata: | |
| ISSN: | 0219-1377, 0219-3116 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Abstract | Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying
focused crawling
techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose
xCrawl
, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision. |
|---|---|
| AbstractList | Issue Title: Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio, Einoshin Suzuki and Kai Ming Ting Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.[PUBLICATION ABSTRACT] Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision. Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl , a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision. |
| Author | Shchekotykhin, Kostyantyn Friedrich, Gerhard Jannach, Dietmar |
| Author_xml | – sequence: 1 givenname: Kostyantyn surname: Shchekotykhin fullname: Shchekotykhin, Kostyantyn email: kostya@ifit.uni-klu.ac.at organization: University Klagenfurt – sequence: 2 givenname: Dietmar surname: Jannach fullname: Jannach, Dietmar organization: Technische Universität Dortmund – sequence: 3 givenname: Gerhard surname: Friedrich fullname: Friedrich, Gerhard organization: University Klagenfurt |
| BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23412923$$DView record in Pascal Francis |
| BookMark | eNp9kE1LxDAQhoOs4O7qD_BWBPFUzSRtmniTxS9Y8KJ4DGma7mZp0zXpov57U7ooCHqaMDzPzOSdoYnrnEHoFPAlYFxcBcAAeYqxSDFhLKUHaIoJiJQCsMn-DbQojtAshA3GUDCAKco-Fl69N9eJStZ2tU690appEj00rVslrenXXZXUnU9eTZm01sXuMTqsVRPMyb7O0cvd7fPiIV0-3T8ubpapjov6lNQVNSURRjGRM17npRCQZaQoTc5KxWmlCJRZpbHhwmBSFbmimAuuATRjhs7RxTh367u3nQm9bG3QpmmUM90uSM4gpwIIj-TZL3LT7byLx0mOiww4z0WEzveQCvGXtVdO2yC33rbKf0pCMyCC0MjByGnfheBN_Y0AlkPackxbxrTlkLYcnOKXo22vetu53ivb_GuS0Qxxi1sZ_3P639IXpaGSkA |
| CODEN | KISNCR |
| CitedBy_id | crossref_primary_10_1007_s10115_012_0478_9 crossref_primary_10_1109_ACCESS_2020_2984503 crossref_primary_10_1007_s10115_012_0535_4 crossref_primary_10_1016_j_ipm_2016_11_006 crossref_primary_10_1016_S2095_3119_12_60068_9 |
| Cites_doi | 10.1007/s10115-008-0152-4 10.1108/eb026866 10.1007/s10115-007-0094-2 10.1016/S0169-7552(98)00108-1 10.1145/1142473.1142504 10.1016/S0004-3702(00)00004-7 10.1016/S1389-1286(99)00052-3 10.2753/JEC1086-4415110201 10.1145/383952.383995 10.1109/ICWE.2008.24 10.1145/1298406.1298438 10.1109/TKDE.2003.1208999 10.1109/TKDE.2004.1264823 10.1145/324133.324140 10.1145/1242572.1242630 10.1007/s10115-007-0107-1 10.1007/3-540-48686-0_1 10.1145/775152.775178 10.1016/j.websem.2009.04.002 |
| ContentType | Journal Article |
| Copyright | Springer-Verlag London Limited 2009 2015 INIST-CNRS Springer-Verlag London Limited 2010 |
| Copyright_xml | – notice: Springer-Verlag London Limited 2009 – notice: 2015 INIST-CNRS – notice: Springer-Verlag London Limited 2010 |
| DBID | AAYXX CITATION IQODW 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8AO 8FD 8FE 8FG 8FK 8FL ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ HCIFZ JQ2 K60 K6~ K7- L.- L.0 L7M L~C L~D M0C M0N P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI Q9U |
| DOI | 10.1007/s10115-009-0266-3 |
| DatabaseName | CrossRef Pascal-Francis ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni Edition) ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Computer Science Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology collection ProQuest One Community College ProQuest Central Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced ABI/INFORM Professional Standard Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database ProQuest advanced technologies & aerospace journals ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central Basic |
| DatabaseTitle | CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Pharma Collection ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ABI/INFORM Professional Standard ProQuest Central Korea ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest Central (Alumni) ProQuest One Academic (New) Business Premium Collection (Alumni) |
| DatabaseTitleList | ABI/INFORM Global (Corporate) Computer and Information Systems Abstracts |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science Applied Sciences |
| EISSN | 0219-3116 |
| EndPage | 326 |
| ExternalDocumentID | 2192393871 23412923 10_1007_s10115_009_0266_3 |
| Genre | Feature |
| GroupedDBID | -59 -5G -BR -EM -Y2 -~C .4S .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 203 29L 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5VS 67Z 6KP 6NX 7WY 8AO 8FE 8FG 8FL 8FW 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACGFO ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACREN ACSNA ACZOJ ADHHG ADHIR ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEBTG AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFGCZ AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. BA0 BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO EBLON EBS EDO EIOEI EJD ESBYG F5P FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IHE IJ- IKXTQ ITM IWAJR IXC IXE IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV LAS LLZTM M0C M0N M4Y MA- MK~ ML~ N2Q NB0 NPVJJ NQJWS NU0 O9- O93 O9J OAM P2P P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOS R89 R9I RIG ROL RPX RSV S16 S1Z S27 S3B SAP SCO SDH SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z5O Z7R Z7S Z7X Z7Y Z7Z Z81 Z83 Z88 ZMTXR ~A9 AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG AEZWR AFDZB AFFHD AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT PQGLB IQODW 7SC 7XB 8AL 8FD 8FK JQ2 L.- L.0 L7M L~C L~D PKEHL PQEST PQUKI Q9U |
| ID | FETCH-LOGICAL-c377t-2fd3eb29ea69568f5b9914427be56ba83da21b4dc0e89e02d75a30898c11c66e3 |
| IEDL.DBID | 7WY |
| ISICitedReferencesCount | 7 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000283510200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0219-1377 |
| IngestDate | Sun Nov 09 09:48:49 EST 2025 Sat Nov 08 15:41:58 EST 2025 Mon Jul 21 09:14:52 EDT 2025 Tue Nov 18 22:24:31 EST 2025 Sat Nov 29 02:29:17 EST 2025 Fri Feb 21 02:36:13 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Keywords | Information retrieval Web crawling Information extraction Web mining Data analysis Extraction process Redundancy Electronic document Data mining Information browsing World wide web Automatic generation Internet Web site |
| Language | English |
| License | http://www.springer.com/tdm CC BY 4.0 |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c377t-2fd3eb29ea69568f5b9914427be56ba83da21b4dc0e89e02d75a30898c11c66e3 |
| Notes | SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23 |
| PQID | 807418859 |
| PQPubID | 43394 |
| PageCount | 24 |
| ParticipantIDs | proquest_miscellaneous_861539128 proquest_journals_807418859 pascalfrancis_primary_23412923 crossref_primary_10_1007_s10115_009_0266_3 crossref_citationtrail_10_1007_s10115_009_0266_3 springer_journals_10_1007_s10115_009_0266_3 |
| PublicationCentury | 2000 |
| PublicationDate | 2010-11-01 |
| PublicationDateYYYYMMDD | 2010-11-01 |
| PublicationDate_xml | – month: 11 year: 2010 text: 2010-11-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | London |
| PublicationPlace_xml | – name: London |
| PublicationSubtitle | An International Journal |
| PublicationTitle | Knowledge and information systems |
| PublicationTitleAbbrev | Knowl Inf Syst |
| PublicationYear | 2010 |
| Publisher | Springer-Verlag Springer Springer Nature B.V |
| Publisher_xml | – name: Springer-Verlag – name: Springer – name: Springer Nature B.V |
| References | Gatterbauer, Bohunsky, Herzog, Williamson, Zurko, Patel-Schneider (CR14) 2007 Tong, Faloutsos, Pan (CR29) 2008; 14 Chakrabarti, van den Berg, Dom (CR5) 1999; 31 CR19 Schonfeld, Bar-Yossef, Keidar (CR26) 2009; 3 Yu, Han, Chang (CR32) 2004; 16 CR17 Kleinberg (CR18) 1999; 46 Rennie, McCallum, Bratko, Dzeroski (CR24) 1999 Peng, Zuo, He (CR23) 2008; 16 Agichtein, Gravano, Dayal, Ramamritham, Vijayaraman (CR2) 2003 Wang, Hu, Zeng (CR30) 2009; 19 Dasgupta, Ghosh, Kumar, Williamson, Zurko, Patel-Schneider (CR9) 2007 Cho, Garcia-Molina, Page (CR7) 1998; 30 Bergholz, Chidlovskii, Catarci, Mercella, Mylopoulos, Orlowska (CR3) 2003 Haveliwala (CR15) 2003; 15 Witten, Frank (CR31) 2000 Dill, Eiron, Gibson, Hencsey, White, Chen (CR11) 2003 Aggarwal, Al-Garawi, Yu, Shen, Saito, Lyu, Zurko (CR1) 2001 Chakrabarti (CR4) 2003 Robertson (CR25) 1990; 46 Kruger, Giles, Coetzee, Agah, Callan, Rundensteiner (CR20) 2000 Menczer, Pant, Srinivasan, Kraft, Croft, Harper (CR21) 2001 Diligenti, Coetzee, Lawrence, Abbadi, Brodie, Chakravarthy (CR10) 2000 Felfernig, Friedrich, Jannach (CR13) 2007; 11 Craven, DiPasquo, Freitag (CR8) 2000; 118 Shchekotykhin, Jannach, Friedrich, Sleeman, Barker (CR27) 2007 Chakrabarti, Punera, Subramanyam, Lassner, De Roure, Iyengar (CR6) 2002 Ipeirotis, Agichtein, Jain, Chaudhuri, Hristidis, Polyzotis (CR16) 2006 Mesbah, Bozdag, van Deursen, Schwabe, Curbera, Dantzig (CR22) 2008 Shchekotykhin, Jannach, Friedrich, Aberer, Choi, Noy (CR28) 2007 Ester, Kriegel, Schubert, Nascimento, Özsu, Kossmann (CR12) 2004 K Shchekotykhin (266_CR28) 2007 A Bergholz (266_CR3) 2003 266_CR17 E Agichtein (266_CR2) 2003 H Yu (266_CR32) 2004; 16 CC Aggarwal (266_CR1) 2001 J Kleinberg (266_CR18) 1999; 46 266_CR19 S Dill (266_CR11) 2003 J Cho (266_CR7) 1998; 30 M Craven (266_CR8) 2000; 118 M Ester (266_CR12) 2004 PG Ipeirotis (266_CR16) 2006 K Shchekotykhin (266_CR27) 2007 M Diligenti (266_CR10) 2000 A Felfernig (266_CR13) 2007; 11 T Peng (266_CR23) 2008; 16 S Chakrabarti (266_CR5) 1999; 31 F Menczer (266_CR21) 2001 H Tong (266_CR29) 2008; 14 A Dasgupta (266_CR9) 2007 SE Robertson (266_CR25) 1990; 46 S Chakrabarti (266_CR4) 2003 W Gatterbauer (266_CR14) 2007 I Witten (266_CR31) 2000 P Wang (266_CR30) 2009; 19 U Schonfeld (266_CR26) 2009; 3 A Mesbah (266_CR22) 2008 J Rennie (266_CR24) 1999 A Kruger (266_CR20) 2000 S Chakrabarti (266_CR6) 2002 TH Haveliwala (266_CR15) 2003; 15 |
| References_xml | – start-page: 178 year: 2003 end-page: 186 ident: CR11 article-title: SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation publication-title: Proceedings of the 12th international conference on world wide web – volume: 19 start-page: 265 year: 2009 end-page: 281 ident: CR30 article-title: Using Wikipedia knowledge to improve text classification publication-title: Knowl Inf Syst doi: 10.1007/s10115-008-0152-4 – start-page: 113 year: 2003 end-page: 124 ident: CR2 article-title: Querying text databases for efficient information extraction publication-title: Proceedings of the 19th IEEE international conference on data engineering – start-page: 527 year: 2000 end-page: 534 ident: CR10 article-title: Focused crawling using context graphs publication-title: Proceedings of 26th international conference on very large data bases – volume: 46 start-page: 359 year: 1990 end-page: 364 ident: CR25 article-title: On term selection for query expansion publication-title: J Documentation doi: 10.1108/eb026866 – volume: 14 start-page: 327 year: 2008 end-page: 346 ident: CR29 article-title: Random walk with restart: fast solutions and applications publication-title: Knowl Inf Syst doi: 10.1007/s10115-007-0094-2 – start-page: 96 year: 2001 end-page: 105 ident: CR1 article-title: Intelligent crawling on the World Wide Web with arbitrary predicates publication-title: Proceedings of the 10th international world wide web conference – volume: 30 start-page: 161 year: 1998 end-page: 172 ident: CR7 article-title: Efficient crawling through URL ordering publication-title: Comput Netw ISDN Syst doi: 10.1016/S0169-7552(98)00108-1 – start-page: 148 year: 2002 end-page: 159 ident: CR6 article-title: Accelerated focused crawling through online relevance feedback publication-title: Proceedings of the 11th International World Wide Web Conference – start-page: 265 year: 2006 end-page: 276 ident: CR16 article-title: To search or to crawl?: towards a query optimizer for text-centric tasks publication-title: Proceedings of the 2006 ACM SIGMOD international conference on management of data doi: 10.1145/1142473.1142504 – start-page: 272 year: 2000 end-page: 281 ident: CR20 article-title: DEADLINER: building a new Niche search engine publication-title: Proceedings of 9th international conference on information and knowledge management – volume: 118 start-page: 69 year: 2000 end-page: 113 ident: CR8 article-title: Learning to construct knowledge bases from the World Wide Web publication-title: Artif Intell doi: 10.1016/S0004-3702(00)00004-7 – volume: 31 start-page: 1623 year: 1999 end-page: 1640 ident: CR5 article-title: Focused crawling: a new approach to topic-specific Web resource discovery publication-title: Comput Netw doi: 10.1016/S1389-1286(99)00052-3 – volume: 11 start-page: 11 year: 2007 end-page: 34 ident: CR13 article-title: An integrated environment for the development of knowledge-based recommender applications publication-title: Int J Electron Commer doi: 10.2753/JEC1086-4415110201 – start-page: 396 year: 2004 end-page: 407 ident: CR12 article-title: Accurate and efficient crawling for relevant websites publication-title: Proceedings of the thirtieth international conference on very large data bases – volume: 3 start-page: 3 year: 2009 end-page: 31 ident: CR26 article-title: Do not crawl in the DUST: different URLs with similar text publication-title: ACM Trans Web – start-page: 241 year: 2001 end-page: 249 ident: CR21 article-title: Evaluating topic-driven web crawlers publication-title: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval doi: 10.1145/383952.383995 – start-page: 122 year: 2008 end-page: 134 ident: CR22 article-title: Crawling AJAX by inferring user interface state changes publication-title: Proceedings of the 8th international conference on web engineering doi: 10.1109/ICWE.2008.24 – start-page: 169 year: 2007 end-page: 170 ident: CR27 article-title: Clustering Web documents with tables for information extraction publication-title: Proccedings of the 4th international conference on knowledge capture doi: 10.1145/1298406.1298438 – ident: CR19 – volume: 15 start-page: 784 year: 2003 end-page: 796 ident: CR15 article-title: Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2003.1208999 – year: 2000 ident: CR31 publication-title: Data mining: practical machine learning tools and techniques with Java implementations – start-page: 335 year: 1999 end-page: 343 ident: CR24 article-title: Using reinforcement learning to spider the Web efficiently publication-title: Proceedings of the 16th international conference on machine learning – ident: CR17 – start-page: 463 year: 2007 end-page: 476 ident: CR28 article-title: AllRight: automatic ontology instantiation from tabular web documents publication-title: Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference – volume: 16 start-page: 70 year: 2004 end-page: 81 ident: CR32 article-title: PEBL: web page classification without negative examples publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2004.1264823 – volume: 46 start-page: 604 year: 1999 end-page: 632 ident: CR18 article-title: Authoritative sources in a hyperlinked environment publication-title: J ACM doi: 10.1145/324133.324140 – start-page: 125 year: 2003 end-page: 133 ident: CR3 article-title: Crawling for domain-specific Hidden Web resources publication-title: Proceedings of the fourth international conference on web information systems engineering – start-page: 421 year: 2007 end-page: 430 ident: CR9 article-title: The discoverability of the Web publication-title: Proceedings of the 16th international conference on world wide web doi: 10.1145/1242572.1242630 – year: 2003 ident: CR4 publication-title: Mining the Web: discovering knowledge from hypertext data – volume: 16 start-page: 281 year: 2008 end-page: 301 ident: CR23 article-title: SVM based adaptive learning method for text classification from positive and unlabeled documents publication-title: Knowl Inf Syst doi: 10.1007/s10115-007-0107-1 – year: 2007 ident: CR14 article-title: Towards domain-independent information extraction from web tables publication-title: Proceedings of the 16th international conference on world wide web – start-page: 527 volume-title: Proceedings of 26th international conference on very large data bases year: 2000 ident: 266_CR10 – start-page: 265 volume-title: Proceedings of the 2006 ACM SIGMOD international conference on management of data year: 2006 ident: 266_CR16 doi: 10.1145/1142473.1142504 – volume-title: Mining the Web: discovering knowledge from hypertext data year: 2003 ident: 266_CR4 – volume: 16 start-page: 281 year: 2008 ident: 266_CR23 publication-title: Knowl Inf Syst doi: 10.1007/s10115-007-0107-1 – volume: 16 start-page: 70 year: 2004 ident: 266_CR32 publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2004.1264823 – start-page: 421 volume-title: Proceedings of the 16th international conference on world wide web year: 2007 ident: 266_CR9 doi: 10.1145/1242572.1242630 – start-page: 241 volume-title: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval year: 2001 ident: 266_CR21 doi: 10.1145/383952.383995 – start-page: 125 volume-title: Proceedings of the fourth international conference on web information systems engineering year: 2003 ident: 266_CR3 – start-page: 272 volume-title: Proceedings of 9th international conference on information and knowledge management year: 2000 ident: 266_CR20 – volume: 46 start-page: 359 year: 1990 ident: 266_CR25 publication-title: J Documentation doi: 10.1108/eb026866 – ident: 266_CR19 doi: 10.1007/3-540-48686-0_1 – start-page: 169 volume-title: Proccedings of the 4th international conference on knowledge capture year: 2007 ident: 266_CR27 doi: 10.1145/1298406.1298438 – start-page: 122 volume-title: Proceedings of the 8th international conference on web engineering year: 2008 ident: 266_CR22 – start-page: 396 volume-title: Proceedings of the thirtieth international conference on very large data bases year: 2004 ident: 266_CR12 – start-page: 335 volume-title: Proceedings of the 16th international conference on machine learning year: 1999 ident: 266_CR24 – start-page: 96 volume-title: Proceedings of the 10th international world wide web conference year: 2001 ident: 266_CR1 – start-page: 148 volume-title: Proceedings of the 11th International World Wide Web Conference year: 2002 ident: 266_CR6 – volume: 3 start-page: 3 year: 2009 ident: 266_CR26 publication-title: ACM Trans Web – volume: 15 start-page: 784 year: 2003 ident: 266_CR15 publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2003.1208999 – volume: 14 start-page: 327 year: 2008 ident: 266_CR29 publication-title: Knowl Inf Syst doi: 10.1007/s10115-007-0094-2 – start-page: 463 volume-title: Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference year: 2007 ident: 266_CR28 – start-page: 113 volume-title: Proceedings of the 19th IEEE international conference on data engineering year: 2003 ident: 266_CR2 – volume: 46 start-page: 604 year: 1999 ident: 266_CR18 publication-title: J ACM doi: 10.1145/324133.324140 – start-page: 178 volume-title: Proceedings of the 12th international conference on world wide web year: 2003 ident: 266_CR11 doi: 10.1145/775152.775178 – volume: 11 start-page: 11 year: 2007 ident: 266_CR13 publication-title: Int J Electron Commer doi: 10.2753/JEC1086-4415110201 – volume: 118 start-page: 69 year: 2000 ident: 266_CR8 publication-title: Artif Intell doi: 10.1016/S0004-3702(00)00004-7 – volume-title: Proceedings of the 16th international conference on world wide web year: 2007 ident: 266_CR14 – volume: 30 start-page: 161 year: 1998 ident: 266_CR7 publication-title: Comput Netw ISDN Syst doi: 10.1016/S0169-7552(98)00108-1 – volume: 31 start-page: 1623 year: 1999 ident: 266_CR5 publication-title: Comput Netw doi: 10.1016/S1389-1286(99)00052-3 – volume: 19 start-page: 265 year: 2009 ident: 266_CR30 publication-title: Knowl Inf Syst doi: 10.1007/s10115-008-0152-4 – ident: 266_CR17 doi: 10.1016/j.websem.2009.04.002 – volume-title: Data mining: practical machine learning tools and techniques with Java implementations year: 2000 ident: 266_CR31 |
| SSID | ssj0017611 |
| Score | 1.9179066 |
| Snippet | Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the... Issue Title: Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio,... |
| SourceID | proquest pascalfrancis crossref springer |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 303 |
| SubjectTerms | Algorithms Analysis Applied sciences Automation Computer Science Computer science; control theory; systems Computer systems and distributed systems. User interface Data mining Data Mining and Knowledge Discovery Data processing. List processing. Character string processing Database Management Descriptions Digital cameras Exact sciences and technology Extraction Hierarchies Information retrieval Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) Information systems. Data bases IT in Business Memory organisation. Data processing Mining Recall Regular Paper Search engines Searches Software Studies URLs Websites |
| SummonAdditionalLinks | – databaseName: SpringerLINK Contemporary 1997-Present dbid: RSV link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3dS8MwED90-iCI8xPrdOTBJyXQph9LfJOh-DQEv_ZWkjQFoW5j3dQ_30vWbk5U0Nc2ScMll_tdL3c_gFP0kROlM04l15JGgcqpiJWkaFyZ1jpBqyUd2USn1-P9vrit8rjL-rZ7HZJ0J_WnZDdEL9T9zEerQsNVWIttsRnrot89zkMH6Jc7mjxURWrL6dWhzO-GWDJGmyNZolzyGaHFEuL8EiR1tue6-a9Zb8NWBTXJ5Wxv7MCKGexCs6ZxIJVW70H03h3Lt-KCSGKLF1M8A2VREG0f4qfIjGSaILolT0aRF0cpsQ8P11f33RtakSlQjUKYUJZnIXrRwsjEZgjmsUJkGEWso0ycKMnDTLJARZn2DRfGZ1knlqHPBddBoJPEhAfQGAwH5hAIOiFGZQw9pwTRC_q4Ee6DkHViJngmtPHAr6Wa6qrSuCW8KNJFjWQrlRSlklqppKEHZ_Muo1mZjd8at5eWat6DoUlmiFg9aNVrl1YqWaa26k_AeSw8IPO3qEs2QCIHZjjFJhb9CrTYHpzXq7kY4Mf5HP2pdQs23A0El894DI3JeGpOYF2_Tp7Lcdvt5Q-6MusS priority: 102 providerName: Springer Nature |
| Title | xCrawl: a high-recall crawling method for Web mining |
| URI | https://link.springer.com/article/10.1007/s10115-009-0266-3 https://www.proquest.com/docview/807418859 https://www.proquest.com/docview/861539128 |
| Volume | 25 |
| WOSCitedRecordID | wos000283510200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVPQU databaseName: ABI/INFORM Collection customDbUrl: eissn: 0219-3116 dateEnd: 20171231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: 7WY dateStart: 20020101 isFulltext: true titleUrlDefault: https://www.proquest.com/abicomplete providerName: ProQuest – providerCode: PRVPQU databaseName: ABI/INFORM Global customDbUrl: eissn: 0219-3116 dateEnd: 20171231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: M0C dateStart: 20020101 isFulltext: true titleUrlDefault: https://search.proquest.com/abiglobal providerName: ProQuest – providerCode: PRVPQU databaseName: Advanced Technologies & Aerospace Database customDbUrl: eissn: 0219-3116 dateEnd: 20171231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: P5Z dateStart: 20020101 isFulltext: true titleUrlDefault: https://search.proquest.com/hightechjournals providerName: ProQuest – providerCode: PRVPQU databaseName: Computer Science Database customDbUrl: eissn: 0219-3116 dateEnd: 20171231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: K7- dateStart: 20020101 isFulltext: true titleUrlDefault: http://search.proquest.com/compscijour providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 0219-3116 dateEnd: 20171231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: BENPR dateStart: 20020101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVAVX databaseName: SpringerLink customDbUrl: eissn: 0219-3116 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0017611 issn: 0219-1377 databaseCode: RSV dateStart: 19990201 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LaxsxEB6axyEQ4uZFNmmMDjk1iOxqX1IvITUOhVJj0jZ2clkkrQwF13b8SPLzO9I-jAPxpRfB7kq7QqPRfLMjzQdwgT5yonTOqeRa0ihQAypiJSkaV6a1TtBqSUc2kXY6vN8X3XJvzqzcVlmtiW6hzsfa_iO_sklbAs5jcT15opY0ygZXSwaNDdhCOx1bAoO091AHEdBDd4R5qJTUJtargprFyTmEQtRFBtBE0XDFLO1O5AxHaFBQW6xgzzfhUmeFbhv_2f-PsFfCT3JTzJd9-GBGB9CoqB1IqemHEL22pvJl-IVIYhMaU1wX5XBItL2JHSMF8TRBxEt6RpG_jmbiCH7ftn-1vtGSYIFqHI45ZYM8RM9aGJnYU4ODWCFajCKWKhMnSvIwlyxQUa59w4XxWZ7GMvS54DoIdJKY8Bg2R-OROQGCjolROUNvKkFEg35vhHMjZGnMBM-FNh741fhmusw-bkkwhtkyb7IVSYYiyaxIstCDz3WTSZF6Y13l5orQ6hYMzTRDFOvBWSWXrFTTWVYLxQNSP0X9skETOTLjBVaxiFigFffgspL98gXv9ud07efOYMftQnBnGj_B5ny6MOewrZ_nf2bTppvFTdj62u507_Dqe0qx_OG3sOzGj1je_bz_B36G-tw |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB5RQGolVB5tRUoBH-ACstjYiWNXqqqKh0ALKw4guKW245UqLbvL7lLaH9X_2LHzQIsENw5ckziJ7M8z32Qy8wFsYYwsjC0k1dJqmsSmS1VqNEXnyqy1Ar2WDmITWacjr6_V-Qz8q2th_G-VtU0MhroYWP-NfM83bYmlTNX34S31olE-uVoraJSoaLu_9xixjb-dHODybjN2dHixf0wrUQFqeZZNKOsWHKNJ5bTwlXLd1CBDShKWGZcKoyUvNItNUtiWk8q1WJGlmrekkjaOrRCO433fwFzCpfAbqp3RJmmRiSD3i15TUd_Ir06ilpV6SL1oyESgS6R8yg0uDPUYV6RbSmlMcd1H6dng9Y4WX9l8LcH7il6TH-V-WIYZ11-BxVq6glSW7AMkf_ZH-r73lWjiGzZTtPu61yPWH8SJIKWwNkFGT66cITdBRuMjXL7Iu3-C2f6g71aBYODlTMEwWhTI2DCuTxD7nGUpU7JQ1kXQqtczt1V3dS_y0csf-kJ7COQIgdxDIOcR7DRDhmVrkecu3pgCSTOCIQ1hyNIjWKtxkFdmaJw3IIiANGfRfvikkO67wR1e4hm_QpYSwW6NtYcbPPk-n5993Ca8Pb44O81PTzrtNXgX_rgI9ZtfYHYyunPrMG9_T36NRxthBxH4-dIQ_A-g71Hv |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1bSxwxFD7oWkQQta3iaNU8tC8tYXeSuSSCFG-LoixLaalvY5LJgLDurrvrpT-t_86TuckK-uaDrzOZmZB8Oec7c5LzAXzFGDnSJhVUCaNo4OuMylAris6VGWMi9FoqF5uIOx1xcSG7M_C_OgvjtlVWNjE31OnAuH_kTVe0xRcilM2s3BXRPWr_HN5QJyDlEq2VmkaBkDP77x6jt_He6RFO9TfG2se_D09oKTBADY_jCWVZyjGylFZF7tRcFmpkS0HAYm3DSCvBU8V8HaSmZYW0LZbGoeItIYXxfRNFluN7Z2Eu5hjzNGDu4LjT_VWnMOIoF_9FHyqpK-tXpVSLc3tIxGiel0AHSfmUU1wcqjHOT1YIa0wx32fJ2twHtpff8eitwFJJvMl-sVI-woztf4LlStSClDbuMwQPhyN139slirhSzhQ9gur1iHEXcVBIIblNkOuTv1aT61xgYxX-vEnf16DRH_TtOhAMyaxOGcaREXI5jPgDXBWcxSGTIpXGetCq5jYxZd11J__RS54qRjs4JAiHxMEh4R58rx8ZFkVHXmu8PQWY-gmGBIUhf_dgs8JEUhqocVIDwgNS30XL4tJFqm8Ht9jExQIS-YsHPyrcPb3gxf5svPq5HZhH5CXnp52zTVjIt2LkBzu_QGMyurVb8MHcTa7Go-1yORG4fGsMPgJvj1xB |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=xCrawl%3A+a+high-recall+crawling+method+for+Web+mining&rft.jtitle=Knowledge+and+information+systems&rft.au=Shchekotykhin%2C+Kostyantyn&rft.au=Jannach%2C+Dietmar&rft.au=Friedrich%2C+Gerhard&rft.date=2010-11-01&rft.pub=Springer+Nature+B.V&rft.issn=0219-1377&rft.eissn=0219-3116&rft.volume=25&rft.issue=2&rft.spage=303&rft_id=info:doi/10.1007%2Fs10115-009-0266-3&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=2192393871 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0219-1377&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0219-1377&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0219-1377&client=summon |