Wikipedia HTML Structure Analysis for Ontology Construction

Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology...

Full description

Saved in:

Bibliographic Details
Published in:	Knowledge organization Vol. 45; no. 2; pp. 108 - 124
Main Authors:	Zarrad, Rim, Doggaz, Narjes, Zagrouba, and Ezzedine
Format:	Journal Article
Language:	English
Published:	Baden-Baden Nomos Verlagsgesellschaft mbH & Co. KG 01.01.2018 Nomos Verlagsgesellschaft mbH und Co KG
Subjects:	Classification Data Experiments Extraction French language Information Information retrieval Information sources Internet Natural language Ontology Organizational structure Semantic relations Semantics Taxonomy
ISSN:	0943-7444
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
AbstractList	Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
Author	Doggaz, Narjes Zagrouba, and Ezzedine Zarrad, Rim
Author_xml	– sequence: 1 givenname: Rim surname: Zarrad fullname: Zarrad, Rim – sequence: 2 givenname: Narjes surname: Doggaz fullname: Doggaz, Narjes – sequence: 3 givenname: and Ezzedine surname: Zagrouba fullname: Zagrouba, and Ezzedine
BookMark	eNqFkM1KAzEUhbOoYK19AxcDrqP5nUzoqhS1QqULKy7DTJpIapuMycyiPn1nrKig4N1cuPecw-E7AwMfvAHgAqMrLgS-RpJRKBhjkCBcQAIxKgZg-HU-BeOUNqibnAiekyGYPLtXV5u1K7P56mGRPTax1U0bTTb15XafXMpsiNnSN2EbXvbZLPj0IXHBn4MTW26TGX_uEXi6vVnN5nCxvLufTRdQU0IbyGxlhSiLrpI2lmipieWVkIZzTFglSotzYXWVG4sLrTm3jBohEeKokhhZOgKXx9w6hrfWpEZtQhu7ekkRJpmkNM9lp2JHlY4hpWisqqPblXGvMFI9HdVjUD0G1dNRpHsUnY0cbT7sQvpO_sc0-dP0S_zu6h_HJlf12tID5_98tg
ContentType	Journal Article
Copyright	Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018
Copyright_xml	– notice: Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018
DBID	AAYXX CITATION ABUWG AFKRA ALSLI BENPR CCPQU CNYFK DWQXO E3H F2A M1O PHGZM PHGZT PKEHL PQEST PQQKQ PQUKI PRINS PRQQA
DOI	10.5771/0943-7444-2018-2-108
DatabaseName	CrossRef ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection ProQuest Central ProQuest One Community College Library & information science collection. ProQuest Central Korea Library & Information Sciences Abstracts (LISA) Library & Information Science Abstracts (LISA) Library Science Database ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest One Social Sciences
DatabaseTitle	CrossRef Social Science Premium Collection ProQuest One Social Sciences ProQuest One Academic Middle East (New) Library and Information Science Abstracts (LISA) ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Central China ProQuest Central ProQuest Library Science ProQuest One Academic UKI Edition ProQuest Central Korea Library & Information Science Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New)
DatabaseTitleList	Social Science Premium Collection
Database_xml	– sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Library & Information Science
EndPage	124
ExternalDocumentID	10_5771_0943_7444_2018_2_108
GroupedDBID	.4I .4S .DC 0B8 0R1 5GY 77K AAFWJ ABUWG ACHQT ADMHG AFKRA ALMA_UNASSIGNED_HOLDINGS ALSLI ARCSS BENPR CCPQU CNYFK DWQXO EBS EDO EJD ELW I-F M1O P.N TUS 77I AAYXX AFFHD CITATION FRA PHGZM PHGZT PRQQA E3H F2A PKEHL PQEST PQQKQ PQUKI PRINS
ID	FETCH-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3
IEDL.DBID	BENPR
ISICitedReferencesCount	1
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	0943-7444
IngestDate	Sat Nov 15 05:42:43 EST 2025 Sat Nov 29 08:05:04 EST 2025 Sat Nov 02 17:46:54 EDT 2024 Wed Jan 08 03:46:09 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	2
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
PQID	2494933669
PQPubID	2035306
PageCount	17
ParticipantIDs	crossref_primary_10_5771_0943_7444_2018_2_108 nomos_journals_10_5771_0943_7444_2018_2_108 proquest_journals_2494933669
PublicationCentury	2000
PublicationDate	2018-01-01
PublicationDateYYYYMMDD	2018-01-01
PublicationDate_xml	– month: 01 year: 2018 text: 2018-01-01 day: 01
PublicationDecade	2010
PublicationPlace	Baden-Baden
PublicationPlace_xml	– name: Baden-Baden – name: Wuerzburg
PublicationTitle	Knowledge organization
PublicationTitleAlternate	KO
PublicationYear	2018
Publisher	Nomos Verlagsgesellschaft mbH & Co. KG Nomos Verlagsgesellschaft mbH und Co KG
Publisher_xml	– name: Nomos Verlagsgesellschaft mbH & Co. KG – name: Nomos Verlagsgesellschaft mbH und Co KG
SSID	ssj0000627562
Score	2.0760248
Snippet	Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent...
SourceID	proquest crossref nomos
SourceType	Aggregation Database Index Database Enrichment Source Publisher
StartPage	108
SubjectTerms	Classification Data Experiments Extraction French language Information Information retrieval Information sources Internet Natural language Ontology Organizational structure Semantic relations Semantics Taxonomy
Title	Wikipedia HTML Structure Analysis for Ontology Construction
URI	https://www.nomos-elibrary.de/index.php?doi=10.5771/0943-7444-2018-2-108 https://www.proquest.com/docview/2494933669
Volume	45
WOSCitedRecordID	wos000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVPQU databaseName: Library Science Database issn: 0943-7444 databaseCode: M1O dateStart: 20180101 customDbUrl: isFulltext: true dateEnd: 20221211 titleUrlDefault: https://search.proquest.com/libraryscience omitProxy: false ssIdentifier: ssj0000627562 providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central issn: 0943-7444 databaseCode: BENPR dateStart: 20180101 customDbUrl: isFulltext: true dateEnd: 20221211 titleUrlDefault: https://www.proquest.com/central omitProxy: false ssIdentifier: ssj0000627562 providerName: ProQuest
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8QwEB7U9eDFt7i6KzmIFwlumzRp8SAqigddxQd6K5s0gUXsVncV9NebyaY-wYvXNENLvpnJdJL5BmCTpUlhXKhMkRqF8kIx6jZBZ3idlGsWqSKynsT1VHa76d1ddhESbsNwrbL2id5RFwONOfKdGGlUGBMi26seKXaNwtPV0EJjEhrIVOb0vHFw1L24_MiyeBJe31UUb9BRyTkf188lUkY7H2NOVyKnMM4npd_2p0Y5eBgMf7lpv_ccz_33q-dhNkSdZH-sJgswYcpFaIeaBbJFQlESgkSCtS_B7m3_vl9hZQk5uT47JVeeavb5yZCayYQ4KXJe-ha4rwR7f9ZstMtwc3x0fXhCQ68FqlnMRpRbZaXsYTygjY11pmObKJkZF1HFXMmejYS0Wgljo1TrJLGcGZkhfYxyEYdlKzBVDkqzCkT1hI61skJ0DM8K4aZ3VMJii1RyTJsm0HqF82pMqZG7XxFEJEdEckQkR0TyGPlLm8A9DHmwruHvWW_96svgSORVYZuw_UPs75e0agA_BT7RW_v78TrMeA3yqZkWTLnFNm2Y1i-j_vBpIyjlBkyeRefvY1TlMA
linkProvider	ProQuest
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB7xqAQXKBTEttD6AFyQxcZ27ERVhaq2CMSyIEEFN7N2bGlVNZuyC4j-qP7GerwJ0CJx48DVsRPF883Dj_kGYJ1naeFCqEyRGoWKwnAanGBQvHYmLE9MkfhI4tpR3W52fp4fT8CfJhcGr1U2NjEa6mJgcY98myGNCudS5jvVL4pVo_B0tSmhMYbFgbu9CUu24af9r0G-G4ztfjv9skfrqgLUcsZHVHjjleqh57POM5tb5lOjchdiByaM6vlEKm-NdD7JrE1TL7hTORKlmOBbPQ_vnYRpESId1KvD5OhuTydS_sYapnhfjyohxDhbL1Uq2b5rC8hMAjyDBcz-8YbT5eDnYPjIKURPtzv_0uboNczVMTX5PFaCBZhw5SKs1RkZZJPUKVcIQVLbsjfw8az_o19h3gzZOz3skJNIpHt16UjD00LCKHJUxgK_twQrmzZcu0vw_Vn-ZxmmykHpVoCYnrTMGi9l24m8kKF726SceSTK49a1gDYS1dWYMESHhRYiQCMCNCJAIwI0Q3bWFogodl3bjuHjXr_71YPGkdRV4Vuw9d-wpz-y2gDmfsA9Wt4-_fgDzKAgdGe_e_AOZiN64ybUKkyFiXdr8Mpej_rDy_dRHQhcPDe2_gKqUEC7
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Wikipedia+HTML+Structure+Analysis+for+Ontology+Construction&rft.jtitle=Knowledge+organization&rft.au=Rim+Zarrad&rft.au=Doggaz%2C+Narjes&rft.au=Ezzedine+Zagrouba&rft.date=2018-01-01&rft.pub=Nomos+Verlagsgesellschaft+mbH+und+Co+KG&rft.issn=0943-7444&rft.volume=45&rft.issue=2&rft.spage=108&rft_id=info:doi/10.5771%2F0943-7444-2018-2-108
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0943-7444&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0943-7444&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0943-7444&client=summon