Saved in:
Bibliographic Details
Title: [Untitled]
Contributors: The Pennsylvania State University CiteSeerX Archives
Source: http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf.
Collection: CiteSeerX
Subject Terms: Index Terms—About Web Data Extraction, Document Object Model (DOM, Improved Tree Matching algorithm
Description: — Nowadays, Web pages provide a large amount of structured data, which is required by many advanced applications. This data can be searched through their Web query interfaces. The retrieved information is also called ‘deep or hidden data’. The deep data is enwrapped in Web pages in the form of data records. These special Web pages are generated dynamically and presented to users in the form of HTML documents along with other content. These webpages can be a virtual gold mine of information for business, if mined effectively. Web Data Extraction systems or web wrappers are software applications for the purpose of extracting information from Web sources like Web pages. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it. The extracted data is converted into the most convenient structured format and stored for further usage. This paper deals with the development of such a wrapper, which takes search engine result pages as input and converts them into structured format. Secondly, this paper proposes a new algorithm called Improved Tree Matching algorithm, which in turn, is based on the efficient Simple Tree Matching (STM) algorithm. Towards the end of this work, there is given a comparison with existing works. Experimental results show that this approach can extract web data with lower complexity compared to other existing approaches.
Document Type: text
File Description: application/pdf
Language: English
Relation: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.682.9853; http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf
Availability: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.682.9853
http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf
Rights: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
Accession Number: edsbas.8069B5C0
Database: BASE
Description
Abstract:— Nowadays, Web pages provide a large amount of structured data, which is required by many advanced applications. This data can be searched through their Web query interfaces. The retrieved information is also called ‘deep or hidden data’. The deep data is enwrapped in Web pages in the form of data records. These special Web pages are generated dynamically and presented to users in the form of HTML documents along with other content. These webpages can be a virtual gold mine of information for business, if mined effectively. Web Data Extraction systems or web wrappers are software applications for the purpose of extracting information from Web sources like Web pages. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it. The extracted data is converted into the most convenient structured format and stored for further usage. This paper deals with the development of such a wrapper, which takes search engine result pages as input and converts them into structured format. Secondly, this paper proposes a new algorithm called Improved Tree Matching algorithm, which in turn, is based on the efficient Simple Tree Matching (STM) algorithm. Towards the end of this work, there is given a comparison with existing works. Experimental results show that this approach can extract web data with lower complexity compared to other existing approaches.