Extracting logical structures from HTML tables

While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should fi...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	Computer standards and interfaces Ročník 30; číslo 5; s. 296 - 308
Hlavní autori:	Kim, Yeon-Seok, Lee, Kyong-Ho
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Elsevier B.V 01.07.2008
Predmet:	Attribute-value relations HTML table Information extraction Structure analysis XML HTML table Information extraction Structure analysis XML Attribute-value relations
ISSN:	0920-5489, 1872-7018
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	While HTML is mainly designed for the visual rendering of Web documents, XML is widely accepted as a standard format to process and manage information. In particular, it can embed the information of logical structures. However, in order to utilize XML, the logical structures of HTML tables should first be extracted and transformed into XML representations. This paper presents an efficient method for the process, which consists of two phases: area segmentation and structure analysis. The area segmentation cleans up tables and segments them into attribute and value areas by checking visual and semantic coherency. The hierarchical structure between attribute and value areas is then analyzed and transformed into an XML representation using a proposed table model. Experimental results with 1180 HTML tables show that the proposed method performs better than conventional methods, resulting in an average accuracy of 86.7%.
ISSN:	0920-5489 1872-7018
DOI:	10.1016/j.csi.2007.08.006