Crawling the Hidden Web

Uložené v:
Podrobná bibliografia
Názov: Crawling the Hidden Web
Autori: Sriram Raghavan, Hector Garcia-molina
Prispievatelia: The Pennsylvania State University CiteSeerX Archives
Zdroj: http://www.cise.ufl.edu/~cgrant/projects/public/morpheus/files/(32)Crawling the HiddenWeb(2001).pdf.
Rok vydania: 2000
Zbierka: CiteSeerX
Predmety: Crawling, Hidden Web, Content extraction, HTML Forms
Popis: Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior regis-tration. In particular, they ignore the tremendous amount of high quality content “hidden ” behind search forms, in large searchable electronic databases. In this paper, we provide a framework for addressing the problem of extracting content from this hidden Web. At Stanford, we have built a task-specific hidden Web crawler called the Hidden Web Exposer (HiWE). We describe the architecture of HiWE and present a number of novel tech-niques that went into its design and implementation. We also present results from experiments we conducted to test and validate our techniques.
Druh dokumentu: text
Popis súboru: application/pdf
Jazyk: English
Relation: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.517.8573; http://www.cise.ufl.edu/~cgrant/projects/public/morpheus/files/(32)Crawling the HiddenWeb(2001).pdf
Dostupnosť: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.517.8573
http://www.cise.ufl.edu/~cgrant/projects/public/morpheus/files/(32)Crawling the HiddenWeb(2001).pdf
Rights: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
Prístupové číslo: edsbas.49E1F0F7
Databáza: BASE
Popis
Abstrakt:Current-day crawlers retrieve content only from the publicly indexable Web, i.e., the set of web pages reachable purely by following hypertext links, ignoring search forms and pages that require authorization or prior regis-tration. In particular, they ignore the tremendous amount of high quality content “hidden ” behind search forms, in large searchable electronic databases. In this paper, we provide a framework for addressing the problem of extracting content from this hidden Web. At Stanford, we have built a task-specific hidden Web crawler called the Hidden Web Exposer (HiWE). We describe the architecture of HiWE and present a number of novel tech-niques that went into its design and implementation. We also present results from experiments we conducted to test and validate our techniques.