Uloženo v:
Podrobná bibliografie
Název: [Untitled]
Přispěvatelé: The Pennsylvania State University CiteSeerX Archives
Zdroj: http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf.
Sbírka: CiteSeerX
Témata: Index Terms—About Web Data Extraction, Document Object Model (DOM, Improved Tree Matching algorithm
Popis: — Nowadays, Web pages provide a large amount of structured data, which is required by many advanced applications. This data can be searched through their Web query interfaces. The retrieved information is also called ‘deep or hidden data’. The deep data is enwrapped in Web pages in the form of data records. These special Web pages are generated dynamically and presented to users in the form of HTML documents along with other content. These webpages can be a virtual gold mine of information for business, if mined effectively. Web Data Extraction systems or web wrappers are software applications for the purpose of extracting information from Web sources like Web pages. A Web Data Extraction system usually interacts with a Web source and extracts data stored in it. The extracted data is converted into the most convenient structured format and stored for further usage. This paper deals with the development of such a wrapper, which takes search engine result pages as input and converts them into structured format. Secondly, this paper proposes a new algorithm called Improved Tree Matching algorithm, which in turn, is based on the efficient Simple Tree Matching (STM) algorithm. Towards the end of this work, there is given a comparison with existing works. Experimental results show that this approach can extract web data with lower complexity compared to other existing approaches.
Druh dokumentu: text
Popis souboru: application/pdf
Jazyk: English
Relation: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.682.9853; http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf
Dostupnost: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.682.9853
http://www.ijrte.org/attachments/File/v2i3/C0696072313.pdf
Rights: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
Přístupové číslo: edsbas.8069B5C0
Databáze: BASE
Buďte první, kdo okomentuje tento záznam!
Nejprve se musíte přihlásit.