xCrawl: a high-recall crawling method for Web mining

Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of t...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge and information systems Vol. 25; no. 2; pp. 303 - 326
Main Authors:	Shchekotykhin, Kostyantyn, Jannach, Dietmar, Friedrich, Gerhard
Format:	Journal Article
Language:	English
Published:	London Springer-Verlag 01.11.2010 Springer Springer Nature B.V
Subjects:	Algorithms Analysis Applied sciences Automation Computer Science Computer science; control theory; systems Computer systems and distributed systems. User interface Data mining Data Mining and Knowledge Discovery Data processing. List processing. Character string processing Database Management Descriptions Digital cameras Exact sciences and technology Extraction Hierarchies Information retrieval Information Storage and Retrieval Information systems Information Systems and Communication Service Information Systems Applications (incl.Internet) Information systems. Data bases IT in Business Memory organisation. Data processing Mining Recall Regular Paper Search engines Searches Software Studies URLs Websites Information retrieval Web crawling Information extraction Web mining Data analysis Extraction process Redundancy Electronic document Data mining Information browsing World wide web Automatic generation Internet Web site
ISSN:	0219-1377, 0219-3116
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Web mining systems exploit the redundancy of data published on the Web to automatically extract information from existing Web documents. The first step in the Information Extraction process is thus to locate as many Web pages as possible that contain relevant information within a limited period of time, a task which is commonly accomplished by applying focused crawling techniques. The performance of such a crawler can be measured by its “recall”, i.e., the percentage of documents found and identified as relevant compared to the total number of existing documents. A higher recall value implies that more redundant data are available, which in turn leads to better results in the subsequent fact extraction phase of the Web mining process. In this paper, we propose xCrawl , a new focused crawling method which outperforms state-of-the-art approaches with respect to the recall values achievable within a given period of time. This method is based on a new combination of ideas and techniques used to identify and exploit the navigational structures of Web sites, such as hierarchies, lists, or maps. In addition, automatic query generation is applied to rapidly collect Web sources containing target documents. The proposed crawling technique was inspired by the requirements of a Web mining system developed to extract product and service descriptions given in tabular form and was evaluated in different application scenarios. Comparisons with existing focused crawling techniques reveal that the new crawling method leads to a significant increase in recall while maintaining precision.
Bibliography:	SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 14 ObjectType-Article-2 content type line 23
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-009-0266-3