Genetic Mining of HTML Structures for Effective Web-Document Retrieval

Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Applied intelligence (Dordrecht, Netherlands) Ročník 18; číslo 3; s. 243 - 256
Hlavní autoři: Kim, Sun, Zhang, Byoung-Tak
Médium: Journal Article
Jazyk:angličtina
Vydáno: Boston Springer Nature B.V 01.05.2003
Témata:
ISSN:0924-669X, 1573-7497
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Web-documents have a number of tags indicating the structure of texts. Text segments marked by HTML tags have specific meaning which can be utilized to improve the performance of document retrieval systems. In this paper, we present a machine learning approach to mine the structure of HTML documents for effective Web-document retrieval. A genetic algorithm is described that learns the importance factors of HTML tags which are used to re-rank the documents retrieved by standard weighting schemes. The proposed method has been evaluated on artificial text sets and a large-scale TREC document collection. Experimental evidence supports that the tag weights are well trained by the proposed algorithm in accordance with the importance factors for retrieval, and indicates that the proposed approach significantly improves the performance in retrieval accuracy. In particular, the use of the document-structure mining approach tends to move relevant documents to upper ranks, which is especially important in interactive Web-information retrieval environments.
Bibliografie:SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 14
ObjectType-Article-1
ObjectType-Feature-2
content type line 23
ISSN:0924-669X
1573-7497
DOI:10.1023/A:1023293820057