A Method of Web Information Automatic Extraction Based on XML
With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information...
Uloženo v:
| Vydáno v: | Applied Mechanics and Materials Ročník 20-23; s. 178 - 183 |
|---|---|
| Hlavní autoři: | , , , |
| Médium: | Journal Article |
| Jazyk: | angličtina |
| Vydáno: |
Zurich
Trans Tech Publications Ltd
01.01.2010
|
| Témata: | |
| ISBN: | 0878492879, 9780878492879 |
| ISSN: | 1660-9336, 1662-7482, 1662-7482 |
| On-line přístup: | Získat plný text |
| Tagy: |
Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
|
| Shrnutí: | With the increasingly high-speed of the internet as well as the increase in the amount of data it contains, users are finding it more and more difficult to gain useful information from the web. How to extract accurate information from the Web efficiently has become an urgent problem. Web information extraction technology has emerged to solve this kind of problem. The method of Web information auto-extraction based on XML is designed through standardizing the HTML document using data translation algorism, forming an extracting rule base by learning the XPath expression of samples, and using extraction rule base to realize auto-extraction of pages of same kind. The results show that this approach should lead to a higher recall ratio and precision ratio, and the result should have a self-description, making it convenient for founding data extraction system of each domain. |
|---|---|
| Bibliografie: | Selected, peer reviewed papers from the 2010 International Conference on Information Technology for Manufacturing Systems (ITMS 2010), Macao, China, Jan. 30-31, 2010 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ISBN: | 0878492879 9780878492879 |
| ISSN: | 1660-9336 1662-7482 1662-7482 |
| DOI: | 10.4028/www.scientific.net/AMM.20-23.178 |

