Wikipedia HTML Structure Analysis for Ontology Construction

Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Knowledge organization Ročník 45; číslo 2; s. 108 - 124
Hlavní autoři:	Zarrad, Rim, Doggaz, Narjes, Zagrouba, and Ezzedine
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Baden-Baden Nomos Verlagsgesellschaft mbH & Co. KG 01.01.2018 Nomos Verlagsgesellschaft mbH und Co KG
Témata:	Classification Data Experiments Extraction French language Information Information retrieval Information sources Internet Natural language Ontology Organizational structure Semantic relations Semantics Taxonomy
ISSN:	0943-7444
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Popis
Shrnutí:	Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
Bibliografie:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	0943-7444
DOI:	10.5771/0943-7444-2018-2-108