Trinity: On Using Trinary Trees for Unsupervised Web Data Extraction

Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to ext...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	IEEE transactions on knowledge and data engineering Ročník 26; číslo 6; s. 1544 - 1556
Hlavní autori:	Sleiman, Hassan A., Corchuelo, Rafael
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	New York IEEE 01.06.2014 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:	Algorithm design and analysis Computing Methodologies Data mining HTML Information extraction Java Knowledge and data engineering tools and techniques Machine learning Particle separators Partitioning algorithms Pattern Recognition Proposals Software Web data extraction automatic wrapper generation wrappers unsupervised learning
ISSN:	1041-4347, 1558-2191
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Web data extractors are used to extract data from web documents in order to feed automated processes. In this article, we propose a technique that works on two or more web documents generated by the same server-side template and learns a regular expression that models it and can later be used to extract data from similar documents. The technique builds on the hypothesis that the template introduces some shared patterns that do not provide any relevant data and can thus be ignored. We have evaluated and compared our technique to others in the literature on a large collection of web documents; our results demonstrate that our proposal performs better than the others and that input errors do not have a negative impact on its effectiveness; furthermore, its efficiency can be easily boosted by means of a couple of parameters, without sacrificing its effectiveness.
Bibliografia:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2013.161