Data extraction and label assignment for web databases
Saved in:
| Title: | Data extraction and label assignment for web databases |
|---|---|
| Authors: | Wang, Jiying, Lochovsky, Fred H. |
| Publication Year: | 2003 |
| Collection: | The Hong Kong University of Science and Technology: HKUST Institutional Repository |
| Subject Terms: | HTML forms, automatic wrapper induction, data annotation, hidden web, information integration, web information extraction |
| Description: | Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment). |
| Document Type: | conference object |
| Language: | English |
| Relation: | https://doi.org/10.1145/775152.775179 |
| DOI: | 10.1145/775152.775179 |
| Availability: | http://repository.hkust.edu.hk/ir/Record/1783.1-76035 https://doi.org/10.1145/775152.775179 http://lbdiscover.ust.hk/uresolver?url_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rfr_id=info:sid/HKUST:SPI&rft.genre=article&rft.issn=&rft.volume=&rft.issue=&rft.date=2003&rft.spage=187&rft.aulast=Wang&rft.aufirst=Jiying&rft.atitle=Data%20extraction%20and%20label%20assignment%20for%20web%20databases&rft.title=Proceedings%20of%20the%2012th%20International%20Conference%20on%20World%20Wide%20Web,%20WWW%202003 http://www.scopus.com/record/display.url?eid=2-s2.0-84880476173&origin=inward |
| Accession Number: | edsbas.8BB9D1FA |
| Database: | BASE |
| Abstract: | Many tools have been developed to help users query, extract and integrate data from web pages generated dynamically from databases, i.e., from the Hidden Web. A key prerequisite for such tools is to obtain the schema of the attributes of the retrieved data. In this paper, we describe a system called, DeLa, which reconstructs (part of) a "hidden" back-end web database. It does this by sending queries through HTML forms, automatically generating regular expression wrappers to extract data objects from the result pages and restoring the retrieved data into an annotated (labelled) table. The whole process needs no human involvement and proves to be fast (less than one minute for wrapper induction for each site) and accurate (over 90% correctness for data extraction and around 80% correctness for label assignment). |
|---|---|
| DOI: | 10.1145/775152.775179 |
Nájsť tento článok vo Web of Science