xCrawl: a high-recall crawling method for Web mining

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of t...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Knowledge and information systems Ročník 25; číslo 2; s. 303 - 326
Hlavní autoři: Shchekotykhin, Kostyantyn, Jannach, Dietmar, Friedrich, Gerhard
Médium: Journal Article
Jazyk:angličtina
Vydáno: London Springer-Verlag 01.11.2010
Springer
Springer Nature B.V
Témata:
ISSN:0219-1377, 0219-3116
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl , a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
AbstractList Issue Title: Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio, Einoshin Suzuki and Kai Ming Ting Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.[PUBLICATION ABSTRACT]
Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its "recall", i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl, a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl , a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
Author Shchekotykhin, Kostyantyn
Friedrich, Gerhard
Jannach, Dietmar
Author_xml – sequence: 1
  givenname: Kostyantyn
  surname: Shchekotykhin
  fullname: Shchekotykhin, Kostyantyn
  email: kostya@ifit.uni-klu.ac.at
  organization: University Klagenfurt
– sequence: 2
  givenname: Dietmar
  surname: Jannach
  fullname: Jannach, Dietmar
  organization: Technische Universität Dortmund
– sequence: 3
  givenname: Gerhard
  surname: Friedrich
  fullname: Friedrich, Gerhard
  organization: University Klagenfurt
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=23412923$$DView record in Pascal Francis
BookMark eNp9kE1LxDAQhoOs4O7qD_BWBPFUzSRtmniTxS9Y8KJ4DGma7mZp0zXpov57U7ooCHqaMDzPzOSdoYnrnEHoFPAlYFxcBcAAeYqxSDFhLKUHaIoJiJQCsMn-DbQojtAshA3GUDCAKco-Fl69N9eJStZ2tU690appEj00rVslrenXXZXUnU9eTZm01sXuMTqsVRPMyb7O0cvd7fPiIV0-3T8ubpapjov6lNQVNSURRjGRM17npRCQZaQoTc5KxWmlCJRZpbHhwmBSFbmimAuuATRjhs7RxTh367u3nQm9bG3QpmmUM90uSM4gpwIIj-TZL3LT7byLx0mOiww4z0WEzveQCvGXtVdO2yC33rbKf0pCMyCC0MjByGnfheBN_Y0AlkPackxbxrTlkLYcnOKXo22vetu53ivb_GuS0Qxxi1sZ_3P639IXpaGSkA
CODEN KISNCR
CitedBy_id crossref_primary_10_1007_s10115_012_0478_9
crossref_primary_10_1109_ACCESS_2020_2984503
crossref_primary_10_1007_s10115_012_0535_4
crossref_primary_10_1016_j_ipm_2016_11_006
crossref_primary_10_1016_S2095_3119_12_60068_9
Cites_doi 10.1007/s10115-008-0152-4
10.1108/eb026866
10.1007/s10115-007-0094-2
10.1016/S0169-7552(98)00108-1
10.1145/1142473.1142504
10.1016/S0004-3702(00)00004-7
10.1016/S1389-1286(99)00052-3
10.2753/JEC1086-4415110201
10.1145/383952.383995
10.1109/ICWE.2008.24
10.1145/1298406.1298438
10.1109/TKDE.2003.1208999
10.1109/TKDE.2004.1264823
10.1145/324133.324140
10.1145/1242572.1242630
10.1007/s10115-007-0107-1
10.1007/3-540-48686-0_1
10.1145/775152.775178
10.1016/j.websem.2009.04.002
ContentType Journal Article
Copyright Springer-Verlag London Limited 2009
2015 INIST-CNRS
Springer-Verlag London Limited 2010
Copyright_xml – notice: Springer-Verlag London Limited 2009
– notice: 2015 INIST-CNRS
– notice: Springer-Verlag London Limited 2010
DBID AAYXX
CITATION
IQODW
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8AO
8FD
8FE
8FG
8FK
8FL
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FRNLG
F~G
GNUQQ
HCIFZ
JQ2
K60
K6~
K7-
L.-
L.0
L7M
L~C
L~D
M0C
M0N
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
Q9U
DOI 10.1007/s10115-009-0266-3
DatabaseName CrossRef
Pascal-Francis
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
ProQuest Pharma Collection
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni Edition)
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology collection
ProQuest One Community College
ProQuest Central
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
ABI/INFORM Professional Standard
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
ProQuest advanced technologies & aerospace journals
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central Basic
DatabaseTitle CrossRef
ABI/INFORM Global (Corporate)
ProQuest Business Collection (Alumni Edition)
ProQuest One Business
Computer Science Database
ProQuest Central Student
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Pharma Collection
ABI/INFORM Complete
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest One Applied & Life Sciences
ABI/INFORM Professional Standard
ProQuest Central Korea
ProQuest Central (New)
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Business Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Business (Alumni)
ProQuest One Academic
ProQuest Central (Alumni)
ProQuest One Academic (New)
Business Premium Collection (Alumni)
DatabaseTitleList ABI/INFORM Global (Corporate)
Computer and Information Systems Abstracts

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
Applied Sciences
EISSN 0219-3116
EndPage 326
ExternalDocumentID 2192393871
23412923
10_1007_s10115_009_0266_3
Genre Feature
GroupedDBID -59
-5G
-BR
-EM
-Y2
-~C
.4S
.86
.DC
.VR
06D
0R~
0VY
1N0
1SB
203
29L
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3V.
4.4
406
408
409
40D
40E
5GY
5VS
67Z
6KP
6NX
7WY
8AO
8FE
8FG
8FL
8FW
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACGFO
ACGFS
ACHSB
ACHXU
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACREN
ACSNA
ACZOJ
ADHHG
ADHIR
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEBTG
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
AZQEC
B-.
BA0
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
EBLON
EBS
EDO
EIOEI
EJD
ESBYG
F5P
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IXE
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
LAS
LLZTM
M0C
M0N
M4Y
MA-
MK~
ML~
N2Q
NB0
NPVJJ
NQJWS
NU0
O9-
O93
O9J
OAM
P2P
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
Q2X
QOS
R89
R9I
RIG
ROL
RPX
RSV
S16
S1Z
S27
S3B
SAP
SCO
SDH
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z5O
Z7R
Z7S
Z7X
Z7Y
Z7Z
Z81
Z83
Z88
ZMTXR
~A9
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
AEZWR
AFDZB
AFFHD
AFHIU
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
PQGLB
IQODW
7SC
7XB
8AL
8FD
8FK
JQ2
L.-
L.0
L7M
L~C
L~D
PKEHL
PQEST
PQUKI
Q9U
ID FETCH-LOGICAL-c377t-2fd3eb29ea69568f5b9914427be56ba83da21b4dc0e89e02d75a30898c11c66e3
IEDL.DBID 7WY
ISICitedReferencesCount 7
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000283510200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0219-1377
IngestDate Sun Nov 09 09:48:49 EST 2025
Sat Nov 08 15:41:58 EST 2025
Mon Jul 21 09:14:52 EDT 2025
Tue Nov 18 22:24:31 EST 2025
Sat Nov 29 02:29:17 EST 2025
Fri Feb 21 02:36:13 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 2
Keywords Information retrieval
Web crawling
Information extraction
Web mining
Data analysis
Extraction process
Redundancy
Electronic document
Data mining
Information browsing
World wide web
Automatic generation
Internet
Web site
Language English
License http://www.springer.com/tdm
CC BY 4.0
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c377t-2fd3eb29ea69568f5b9914427be56ba83da21b4dc0e89e02d75a30898c11c66e3
Notes SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-2
content type line 23
PQID 807418859
PQPubID 43394
PageCount 24
ParticipantIDs proquest_miscellaneous_861539128
proquest_journals_807418859
pascalfrancis_primary_23412923
crossref_primary_10_1007_s10115_009_0266_3
crossref_citationtrail_10_1007_s10115_009_0266_3
springer_journals_10_1007_s10115_009_0266_3
PublicationCentury 2000
PublicationDate 2010-11-01
PublicationDateYYYYMMDD 2010-11-01
PublicationDate_xml – month: 11
  year: 2010
  text: 2010-11-01
  day: 01
PublicationDecade 2010
PublicationPlace London
PublicationPlace_xml – name: London
PublicationSubtitle An International Journal
PublicationTitle Knowledge and information systems
PublicationTitleAbbrev Knowl Inf Syst
PublicationYear 2010
Publisher Springer-Verlag
Springer
Springer Nature B.V
Publisher_xml – name: Springer-Verlag
– name: Springer
– name: Springer Nature B.V
References Gatterbauer, Bohunsky, Herzog, Williamson, Zurko, Patel-Schneider (CR14) 2007
Tong, Faloutsos, Pan (CR29) 2008; 14
Chakrabarti, van den Berg, Dom (CR5) 1999; 31
CR19
Schonfeld, Bar-Yossef, Keidar (CR26) 2009; 3
Yu, Han, Chang (CR32) 2004; 16
CR17
Kleinberg (CR18) 1999; 46
Rennie, McCallum, Bratko, Dzeroski (CR24) 1999
Peng, Zuo, He (CR23) 2008; 16
Agichtein, Gravano, Dayal, Ramamritham, Vijayaraman (CR2) 2003
Wang, Hu, Zeng (CR30) 2009; 19
Dasgupta, Ghosh, Kumar, Williamson, Zurko, Patel-Schneider (CR9) 2007
Cho, Garcia-Molina, Page (CR7) 1998; 30
Bergholz, Chidlovskii, Catarci, Mercella, Mylopoulos, Orlowska (CR3) 2003
Haveliwala (CR15) 2003; 15
Witten, Frank (CR31) 2000
Dill, Eiron, Gibson, Hencsey, White, Chen (CR11) 2003
Aggarwal, Al-Garawi, Yu, Shen, Saito, Lyu, Zurko (CR1) 2001
Chakrabarti (CR4) 2003
Robertson (CR25) 1990; 46
Kruger, Giles, Coetzee, Agah, Callan, Rundensteiner (CR20) 2000
Menczer, Pant, Srinivasan, Kraft, Croft, Harper (CR21) 2001
Diligenti, Coetzee, Lawrence, Abbadi, Brodie, Chakravarthy (CR10) 2000
Felfernig, Friedrich, Jannach (CR13) 2007; 11
Craven, DiPasquo, Freitag (CR8) 2000; 118
Shchekotykhin, Jannach, Friedrich, Sleeman, Barker (CR27) 2007
Chakrabarti, Punera, Subramanyam, Lassner, De Roure, Iyengar (CR6) 2002
Ipeirotis, Agichtein, Jain, Chaudhuri, Hristidis, Polyzotis (CR16) 2006
Mesbah, Bozdag, van Deursen, Schwabe, Curbera, Dantzig (CR22) 2008
Shchekotykhin, Jannach, Friedrich, Aberer, Choi, Noy (CR28) 2007
Ester, Kriegel, Schubert, Nascimento, Özsu, Kossmann (CR12) 2004
K Shchekotykhin (266_CR28) 2007
A Bergholz (266_CR3) 2003
266_CR17
E Agichtein (266_CR2) 2003
H Yu (266_CR32) 2004; 16
CC Aggarwal (266_CR1) 2001
J Kleinberg (266_CR18) 1999; 46
266_CR19
S Dill (266_CR11) 2003
J Cho (266_CR7) 1998; 30
M Craven (266_CR8) 2000; 118
M Ester (266_CR12) 2004
PG Ipeirotis (266_CR16) 2006
K Shchekotykhin (266_CR27) 2007
M Diligenti (266_CR10) 2000
A Felfernig (266_CR13) 2007; 11
T Peng (266_CR23) 2008; 16
S Chakrabarti (266_CR5) 1999; 31
F Menczer (266_CR21) 2001
H Tong (266_CR29) 2008; 14
A Dasgupta (266_CR9) 2007
SE Robertson (266_CR25) 1990; 46
S Chakrabarti (266_CR4) 2003
W Gatterbauer (266_CR14) 2007
I Witten (266_CR31) 2000
P Wang (266_CR30) 2009; 19
U Schonfeld (266_CR26) 2009; 3
A Mesbah (266_CR22) 2008
J Rennie (266_CR24) 1999
A Kruger (266_CR20) 2000
S Chakrabarti (266_CR6) 2002
TH Haveliwala (266_CR15) 2003; 15
References_xml – start-page: 178
  year: 2003
  end-page: 186
  ident: CR11
  article-title: SemTag and seeker: bootstrapping the semantic Web via automated semantic annotation
  publication-title: Proceedings of the 12th international conference on world wide web
– volume: 19
  start-page: 265
  year: 2009
  end-page: 281
  ident: CR30
  article-title: Using Wikipedia knowledge to improve text classification
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-008-0152-4
– start-page: 113
  year: 2003
  end-page: 124
  ident: CR2
  article-title: Querying text databases for efficient information extraction
  publication-title: Proceedings of the 19th IEEE international conference on data engineering
– start-page: 527
  year: 2000
  end-page: 534
  ident: CR10
  article-title: Focused crawling using context graphs
  publication-title: Proceedings of 26th international conference on very large data bases
– volume: 46
  start-page: 359
  year: 1990
  end-page: 364
  ident: CR25
  article-title: On term selection for query expansion
  publication-title: J Documentation
  doi: 10.1108/eb026866
– volume: 14
  start-page: 327
  year: 2008
  end-page: 346
  ident: CR29
  article-title: Random walk with restart: fast solutions and applications
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-007-0094-2
– start-page: 96
  year: 2001
  end-page: 105
  ident: CR1
  article-title: Intelligent crawling on the World Wide Web with arbitrary predicates
  publication-title: Proceedings of the 10th international world wide web conference
– volume: 30
  start-page: 161
  year: 1998
  end-page: 172
  ident: CR7
  article-title: Efficient crawling through URL ordering
  publication-title: Comput Netw ISDN Syst
  doi: 10.1016/S0169-7552(98)00108-1
– start-page: 148
  year: 2002
  end-page: 159
  ident: CR6
  article-title: Accelerated focused crawling through online relevance feedback
  publication-title: Proceedings of the 11th International World Wide Web Conference
– start-page: 265
  year: 2006
  end-page: 276
  ident: CR16
  article-title: To search or to crawl?: towards a query optimizer for text-centric tasks
  publication-title: Proceedings of the 2006 ACM SIGMOD international conference on management of data
  doi: 10.1145/1142473.1142504
– start-page: 272
  year: 2000
  end-page: 281
  ident: CR20
  article-title: DEADLINER: building a new Niche search engine
  publication-title: Proceedings of 9th international conference on information and knowledge management
– volume: 118
  start-page: 69
  year: 2000
  end-page: 113
  ident: CR8
  article-title: Learning to construct knowledge bases from the World Wide Web
  publication-title: Artif Intell
  doi: 10.1016/S0004-3702(00)00004-7
– volume: 31
  start-page: 1623
  year: 1999
  end-page: 1640
  ident: CR5
  article-title: Focused crawling: a new approach to topic-specific Web resource discovery
  publication-title: Comput Netw
  doi: 10.1016/S1389-1286(99)00052-3
– volume: 11
  start-page: 11
  year: 2007
  end-page: 34
  ident: CR13
  article-title: An integrated environment for the development of knowledge-based recommender applications
  publication-title: Int J Electron Commer
  doi: 10.2753/JEC1086-4415110201
– start-page: 396
  year: 2004
  end-page: 407
  ident: CR12
  article-title: Accurate and efficient crawling for relevant websites
  publication-title: Proceedings of the thirtieth international conference on very large data bases
– volume: 3
  start-page: 3
  year: 2009
  end-page: 31
  ident: CR26
  article-title: Do not crawl in the DUST: different URLs with similar text
  publication-title: ACM Trans Web
– start-page: 241
  year: 2001
  end-page: 249
  ident: CR21
  article-title: Evaluating topic-driven web crawlers
  publication-title: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval
  doi: 10.1145/383952.383995
– start-page: 122
  year: 2008
  end-page: 134
  ident: CR22
  article-title: Crawling AJAX by inferring user interface state changes
  publication-title: Proceedings of the 8th international conference on web engineering
  doi: 10.1109/ICWE.2008.24
– start-page: 169
  year: 2007
  end-page: 170
  ident: CR27
  article-title: Clustering Web documents with tables for information extraction
  publication-title: Proccedings of the 4th international conference on knowledge capture
  doi: 10.1145/1298406.1298438
– ident: CR19
– volume: 15
  start-page: 784
  year: 2003
  end-page: 796
  ident: CR15
  article-title: Topic-Sensitive PageRank: a context-sensitive ranking algorithm for Web search
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2003.1208999
– year: 2000
  ident: CR31
  publication-title: Data mining: practical machine learning tools and techniques with Java implementations
– start-page: 335
  year: 1999
  end-page: 343
  ident: CR24
  article-title: Using reinforcement learning to spider the Web efficiently
  publication-title: Proceedings of the 16th international conference on machine learning
– ident: CR17
– start-page: 463
  year: 2007
  end-page: 476
  ident: CR28
  article-title: AllRight: automatic ontology instantiation from tabular web documents
  publication-title: Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference
– volume: 16
  start-page: 70
  year: 2004
  end-page: 81
  ident: CR32
  article-title: PEBL: web page classification without negative examples
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2004.1264823
– volume: 46
  start-page: 604
  year: 1999
  end-page: 632
  ident: CR18
  article-title: Authoritative sources in a hyperlinked environment
  publication-title: J ACM
  doi: 10.1145/324133.324140
– start-page: 125
  year: 2003
  end-page: 133
  ident: CR3
  article-title: Crawling for domain-specific Hidden Web resources
  publication-title: Proceedings of the fourth international conference on web information systems engineering
– start-page: 421
  year: 2007
  end-page: 430
  ident: CR9
  article-title: The discoverability of the Web
  publication-title: Proceedings of the 16th international conference on world wide web
  doi: 10.1145/1242572.1242630
– year: 2003
  ident: CR4
  publication-title: Mining the Web: discovering knowledge from hypertext data
– volume: 16
  start-page: 281
  year: 2008
  end-page: 301
  ident: CR23
  article-title: SVM based adaptive learning method for text classification from positive and unlabeled documents
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-007-0107-1
– year: 2007
  ident: CR14
  article-title: Towards domain-independent information extraction from web tables
  publication-title: Proceedings of the 16th international conference on world wide web
– start-page: 527
  volume-title: Proceedings of 26th international conference on very large data bases
  year: 2000
  ident: 266_CR10
– start-page: 265
  volume-title: Proceedings of the 2006 ACM SIGMOD international conference on management of data
  year: 2006
  ident: 266_CR16
  doi: 10.1145/1142473.1142504
– volume-title: Mining the Web: discovering knowledge from hypertext data
  year: 2003
  ident: 266_CR4
– volume: 16
  start-page: 281
  year: 2008
  ident: 266_CR23
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-007-0107-1
– volume: 16
  start-page: 70
  year: 2004
  ident: 266_CR32
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2004.1264823
– start-page: 421
  volume-title: Proceedings of the 16th international conference on world wide web
  year: 2007
  ident: 266_CR9
  doi: 10.1145/1242572.1242630
– start-page: 241
  volume-title: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval
  year: 2001
  ident: 266_CR21
  doi: 10.1145/383952.383995
– start-page: 125
  volume-title: Proceedings of the fourth international conference on web information systems engineering
  year: 2003
  ident: 266_CR3
– start-page: 272
  volume-title: Proceedings of 9th international conference on information and knowledge management
  year: 2000
  ident: 266_CR20
– volume: 46
  start-page: 359
  year: 1990
  ident: 266_CR25
  publication-title: J Documentation
  doi: 10.1108/eb026866
– ident: 266_CR19
  doi: 10.1007/3-540-48686-0_1
– start-page: 169
  volume-title: Proccedings of the 4th international conference on knowledge capture
  year: 2007
  ident: 266_CR27
  doi: 10.1145/1298406.1298438
– start-page: 122
  volume-title: Proceedings of the 8th international conference on web engineering
  year: 2008
  ident: 266_CR22
– start-page: 396
  volume-title: Proceedings of the thirtieth international conference on very large data bases
  year: 2004
  ident: 266_CR12
– start-page: 335
  volume-title: Proceedings of the 16th international conference on machine learning
  year: 1999
  ident: 266_CR24
– start-page: 96
  volume-title: Proceedings of the 10th international world wide web conference
  year: 2001
  ident: 266_CR1
– start-page: 148
  volume-title: Proceedings of the 11th International World Wide Web Conference
  year: 2002
  ident: 266_CR6
– volume: 3
  start-page: 3
  year: 2009
  ident: 266_CR26
  publication-title: ACM Trans Web
– volume: 15
  start-page: 784
  year: 2003
  ident: 266_CR15
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2003.1208999
– volume: 14
  start-page: 327
  year: 2008
  ident: 266_CR29
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-007-0094-2
– start-page: 463
  volume-title: Proceedings of the 6th international semantic web conference and 2nd Asian semantic web conference
  year: 2007
  ident: 266_CR28
– start-page: 113
  volume-title: Proceedings of the 19th IEEE international conference on data engineering
  year: 2003
  ident: 266_CR2
– volume: 46
  start-page: 604
  year: 1999
  ident: 266_CR18
  publication-title: J ACM
  doi: 10.1145/324133.324140
– start-page: 178
  volume-title: Proceedings of the 12th international conference on world wide web
  year: 2003
  ident: 266_CR11
  doi: 10.1145/775152.775178
– volume: 11
  start-page: 11
  year: 2007
  ident: 266_CR13
  publication-title: Int J Electron Commer
  doi: 10.2753/JEC1086-4415110201
– volume: 118
  start-page: 69
  year: 2000
  ident: 266_CR8
  publication-title: Artif Intell
  doi: 10.1016/S0004-3702(00)00004-7
– volume-title: Proceedings of the 16th international conference on world wide web
  year: 2007
  ident: 266_CR14
– volume: 30
  start-page: 161
  year: 1998
  ident: 266_CR7
  publication-title: Comput Netw ISDN Syst
  doi: 10.1016/S0169-7552(98)00108-1
– volume: 31
  start-page: 1623
  year: 1999
  ident: 266_CR5
  publication-title: Comput Netw
  doi: 10.1016/S1389-1286(99)00052-3
– volume: 19
  start-page: 265
  year: 2009
  ident: 266_CR30
  publication-title: Knowl Inf Syst
  doi: 10.1007/s10115-008-0152-4
– ident: 266_CR17
  doi: 10.1016/j.websem.2009.04.002
– volume-title: Data mining: practical machine learning tools and techniques with Java implementations
  year: 2000
  ident: 266_CR31
SSID ssj0017611
Score 1.9179066
Snippet Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the...
Issue Title: Special Issue:Best Papers from the 12th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD2008);Guest Editors: Takashi Washio,...
SourceID proquest
pascalfrancis
crossref
springer
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 303
SubjectTerms Algorithms
Analysis
Applied sciences
Automation
Computer Science
Computer science; control theory; systems
Computer systems and distributed systems. User interface
Data mining
Data Mining and Knowledge Discovery
Data processing. List processing. Character string processing
Database Management
Descriptions
Digital cameras
Exact sciences and technology
Extraction
Hierarchies
Information retrieval
Information Storage and Retrieval
Information systems
Information Systems and Communication Service
Information Systems Applications (incl.Internet)
Information systems. Data bases
IT in Business
Memory organisation. Data processing
Mining
Recall
Regular Paper
Search engines
Searches
Software
Studies
URLs
Websites
SummonAdditionalLinks – databaseName: SpringerLINK Contemporary 1997-Present
  dbid: RSV
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3dS8MwED90-iCI8xPrdOTBJyXQph9LfJOh-DQEv_ZWkjQFoW5j3dQ_30vWbk5U0Nc2ScMll_tdL3c_gFP0kROlM04l15JGgcqpiJWkaFyZ1jpBqyUd2USn1-P9vrit8rjL-rZ7HZJ0J_WnZDdEL9T9zEerQsNVWIttsRnrot89zkMH6Jc7mjxURWrL6dWhzO-GWDJGmyNZolzyGaHFEuL8EiR1tue6-a9Zb8NWBTXJ5Wxv7MCKGexCs6ZxIJVW70H03h3Lt-KCSGKLF1M8A2VREG0f4qfIjGSaILolT0aRF0cpsQ8P11f33RtakSlQjUKYUJZnIXrRwsjEZgjmsUJkGEWso0ycKMnDTLJARZn2DRfGZ1knlqHPBddBoJPEhAfQGAwH5hAIOiFGZQw9pwTRC_q4Ee6DkHViJngmtPHAr6Wa6qrSuCW8KNJFjWQrlRSlklqppKEHZ_Muo1mZjd8at5eWat6DoUlmiFg9aNVrl1YqWaa26k_AeSw8IPO3qEs2QCIHZjjFJhb9CrTYHpzXq7kY4Mf5HP2pdQs23A0El894DI3JeGpOYF2_Tp7Lcdvt5Q-6MusS
  priority: 102
  providerName: Springer Nature
Title xCrawl: a high-recall crawling method for Web mining
URI https://link.springer.com/article/10.1007/s10115-009-0266-3
https://www.proquest.com/docview/807418859
https://www.proquest.com/docview/861539128
Volume 25
WOSCitedRecordID wos000283510200006&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: ABI/INFORM Collection
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 20171231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: 7WY
  dateStart: 20020101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/abicomplete
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ABI/INFORM Global
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 20171231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: M0C
  dateStart: 20020101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/abiglobal
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Advanced Technologies & Aerospace Database
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 20171231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: P5Z
  dateStart: 20020101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/hightechjournals
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Computer Science Database
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 20171231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: K7-
  dateStart: 20020101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/compscijour
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 20171231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: BENPR
  dateStart: 20020101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVAVX
  databaseName: SpringerLink
  customDbUrl:
  eissn: 0219-3116
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0017611
  issn: 0219-1377
  databaseCode: RSV
  dateStart: 19990201
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LaxsxEB6axyEQ4uZFNmmMDjk1iOxqX1IvITUOhVJj0jZ2clkkrQwF13b8SPLzO9I-jAPxpRfB7kq7QqPRfLMjzQdwgT5yonTOqeRa0ihQAypiJSkaV6a1TtBqSUc2kXY6vN8X3XJvzqzcVlmtiW6hzsfa_iO_sklbAs5jcT15opY0ygZXSwaNDdhCOx1bAoO091AHEdBDd4R5qJTUJtargprFyTmEQtRFBtBE0XDFLO1O5AxHaFBQW6xgzzfhUmeFbhv_2f-PsFfCT3JTzJd9-GBGB9CoqB1IqemHEL22pvJl-IVIYhMaU1wX5XBItL2JHSMF8TRBxEt6RpG_jmbiCH7ftn-1vtGSYIFqHI45ZYM8RM9aGJnYU4ODWCFajCKWKhMnSvIwlyxQUa59w4XxWZ7GMvS54DoIdJKY8Bg2R-OROQGCjolROUNvKkFEg35vhHMjZGnMBM-FNh741fhmusw-bkkwhtkyb7IVSYYiyaxIstCDz3WTSZF6Y13l5orQ6hYMzTRDFOvBWSWXrFTTWVYLxQNSP0X9skETOTLjBVaxiFigFffgspL98gXv9ud07efOYMftQnBnGj_B5ny6MOewrZ_nf2bTppvFTdj62u507_Dqe0qx_OG3sOzGj1je_bz_B36G-tw
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB5RQGolVB5tRUoBH-ACstjYiWNXqqqKh0ALKw4guKW245UqLbvL7lLaH9X_2LHzQIsENw5ckziJ7M8z32Qy8wFsYYwsjC0k1dJqmsSmS1VqNEXnyqy1Ar2WDmITWacjr6_V-Qz8q2th_G-VtU0MhroYWP-NfM83bYmlTNX34S31olE-uVoraJSoaLu_9xixjb-dHODybjN2dHixf0wrUQFqeZZNKOsWHKNJ5bTwlXLd1CBDShKWGZcKoyUvNItNUtiWk8q1WJGlmrekkjaOrRCO433fwFzCpfAbqp3RJmmRiSD3i15TUd_Ir06ilpV6SL1oyESgS6R8yg0uDPUYV6RbSmlMcd1H6dng9Y4WX9l8LcH7il6TH-V-WIYZ11-BxVq6glSW7AMkf_ZH-r73lWjiGzZTtPu61yPWH8SJIKWwNkFGT66cITdBRuMjXL7Iu3-C2f6g71aBYODlTMEwWhTI2DCuTxD7nGUpU7JQ1kXQqtczt1V3dS_y0csf-kJ7COQIgdxDIOcR7DRDhmVrkecu3pgCSTOCIQ1hyNIjWKtxkFdmaJw3IIiANGfRfvikkO67wR1e4hm_QpYSwW6NtYcbPPk-n5993Ca8Pb44O81PTzrtNXgX_rgI9ZtfYHYyunPrMG9_T36NRxthBxH4-dIQ_A-g71Hv
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1bSxwxFD7oWkQQta3iaNU8tC8tYXeSuSSCFG-LoixLaalvY5LJgLDurrvrpT-t_86TuckK-uaDrzOZmZB8Oec7c5LzAXzFGDnSJhVUCaNo4OuMylAris6VGWMi9FoqF5uIOx1xcSG7M_C_OgvjtlVWNjE31OnAuH_kTVe0xRcilM2s3BXRPWr_HN5QJyDlEq2VmkaBkDP77x6jt_He6RFO9TfG2se_D09oKTBADY_jCWVZyjGylFZF7tRcFmpkS0HAYm3DSCvBU8V8HaSmZYW0LZbGoeItIYXxfRNFluN7Z2Eu5hjzNGDu4LjT_VWnMOIoF_9FHyqpK-tXpVSLc3tIxGiel0AHSfmUU1wcqjHOT1YIa0wx32fJ2twHtpff8eitwFJJvMl-sVI-woztf4LlStSClDbuMwQPhyN139slirhSzhQ9gur1iHEXcVBIIblNkOuTv1aT61xgYxX-vEnf16DRH_TtOhAMyaxOGcaREXI5jPgDXBWcxSGTIpXGetCq5jYxZd11J__RS54qRjs4JAiHxMEh4R58rx8ZFkVHXmu8PQWY-gmGBIUhf_dgs8JEUhqocVIDwgNS30XL4tJFqm8Ht9jExQIS-YsHPyrcPb3gxf5svPq5HZhH5CXnp52zTVjIt2LkBzu_QGMyurVb8MHcTa7Go-1yORG4fGsMPgJvj1xB
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=xCrawl%3A+a+high-recall+crawling+method+for+Web+mining&rft.jtitle=Knowledge+and+information+systems&rft.au=Shchekotykhin%2C+Kostyantyn&rft.au=Jannach%2C+Dietmar&rft.au=Friedrich%2C+Gerhard&rft.date=2010-11-01&rft.pub=Springer+Nature+B.V&rft.issn=0219-1377&rft.eissn=0219-3116&rft.volume=25&rft.issue=2&rft.spage=303&rft_id=info:doi/10.1007%2Fs10115-009-0266-3&rft.externalDBID=HAS_PDF_LINK&rft.externalDocID=2192393871
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0219-1377&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0219-1377&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0219-1377&client=summon