Wikipedia HTML Structure Analysis for Ontology Construction

Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology...

Full description

Saved in:
Bibliographic Details
Published in:Knowledge organization Vol. 45; no. 2; pp. 108 - 124
Main Authors: Zarrad, Rim, Doggaz, Narjes, Zagrouba, and Ezzedine
Format: Journal Article
Language:English
Published: Baden-Baden Nomos Verlagsgesellschaft mbH & Co. KG 01.01.2018
Nomos Verlagsgesellschaft mbH und Co KG
Subjects:
ISSN:0943-7444
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
AbstractList Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards.
Author Doggaz, Narjes
Zagrouba, and Ezzedine
Zarrad, Rim
Author_xml – sequence: 1
  givenname: Rim
  surname: Zarrad
  fullname: Zarrad, Rim
– sequence: 2
  givenname: Narjes
  surname: Doggaz
  fullname: Doggaz, Narjes
– sequence: 3
  givenname: and Ezzedine
  surname: Zagrouba
  fullname: Zagrouba, and Ezzedine
BookMark eNqFkM1KAzEUhbOoYK19AxcDrqP5nUzoqhS1QqULKy7DTJpIapuMycyiPn1nrKig4N1cuPecw-E7AwMfvAHgAqMrLgS-RpJRKBhjkCBcQAIxKgZg-HU-BeOUNqibnAiekyGYPLtXV5u1K7P56mGRPTax1U0bTTb15XafXMpsiNnSN2EbXvbZLPj0IXHBn4MTW26TGX_uEXi6vVnN5nCxvLufTRdQU0IbyGxlhSiLrpI2lmipieWVkIZzTFglSotzYXWVG4sLrTm3jBohEeKokhhZOgKXx9w6hrfWpEZtQhu7ekkRJpmkNM9lp2JHlY4hpWisqqPblXGvMFI9HdVjUD0G1dNRpHsUnY0cbT7sQvpO_sc0-dP0S_zu6h_HJlf12tID5_98tg
ContentType Journal Article
Copyright Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018
Copyright_xml – notice: Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018
DBID AAYXX
CITATION
ABUWG
AFKRA
ALSLI
BENPR
CCPQU
CNYFK
DWQXO
E3H
F2A
M1O
PHGZM
PHGZT
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
PRQQA
DOI 10.5771/0943-7444-2018-2-108
DatabaseName CrossRef
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Social Science Premium Collection
ProQuest Central
ProQuest One Community College
Library & information science collection.
ProQuest Central Korea
Library & Information Sciences Abstracts (LISA)
Library & Information Science Abstracts (LISA)
Library Science Database
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
ProQuest One Social Sciences
DatabaseTitle CrossRef
Social Science Premium Collection
ProQuest One Social Sciences
ProQuest One Academic Middle East (New)
Library and Information Science Abstracts (LISA)
ProQuest One Academic Eastern Edition
ProQuest Central (Alumni Edition)
ProQuest One Community College
ProQuest Central China
ProQuest Central
ProQuest Library Science
ProQuest One Academic UKI Edition
ProQuest Central Korea
Library & Information Science Collection
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList Social Science Premium Collection

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Library & Information Science
EndPage 124
ExternalDocumentID 10_5771_0943_7444_2018_2_108
GroupedDBID .4I
.4S
.DC
0B8
0R1
5GY
77K
AAFWJ
ABUWG
ACHQT
ADMHG
AFKRA
ALMA_UNASSIGNED_HOLDINGS
ALSLI
ARCSS
BENPR
CCPQU
CNYFK
DWQXO
EBS
EDO
EJD
ELW
I-F
M1O
P.N
TUS
77I
AAYXX
AFFHD
CITATION
FRA
PHGZM
PHGZT
PRQQA
E3H
F2A
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
ID FETCH-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3
IEDL.DBID BENPR
ISICitedReferencesCount 1
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0943-7444
IngestDate Sat Nov 15 05:42:43 EST 2025
Sat Nov 29 08:05:04 EST 2025
Sat Nov 02 17:46:54 EDT 2024
Wed Jan 08 03:46:09 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 2494933669
PQPubID 2035306
PageCount 17
ParticipantIDs crossref_primary_10_5771_0943_7444_2018_2_108
nomos_journals_10_5771_0943_7444_2018_2_108
proquest_journals_2494933669
PublicationCentury 2000
PublicationDate 2018-01-01
PublicationDateYYYYMMDD 2018-01-01
PublicationDate_xml – month: 01
  year: 2018
  text: 2018-01-01
  day: 01
PublicationDecade 2010
PublicationPlace Baden-Baden
PublicationPlace_xml – name: Baden-Baden
– name: Wuerzburg
PublicationTitle Knowledge organization
PublicationTitleAlternate KO
PublicationYear 2018
Publisher Nomos Verlagsgesellschaft mbH & Co. KG
Nomos Verlagsgesellschaft mbH und Co KG
Publisher_xml – name: Nomos Verlagsgesellschaft mbH & Co. KG
– name: Nomos Verlagsgesellschaft mbH und Co KG
SSID ssj0000627562
Score 2.0760248
Snippet Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent...
SourceID proquest
crossref
nomos
SourceType Aggregation Database
Index Database
Enrichment Source
Publisher
StartPage 108
SubjectTerms Classification
Data
Experiments
Extraction
French language
Information
Information retrieval
Information sources
Internet
Natural language
Ontology
Organizational structure
Semantic relations
Semantics
Taxonomy
Title Wikipedia HTML Structure Analysis for Ontology Construction
URI https://www.nomos-elibrary.de/index.php?doi=10.5771/0943-7444-2018-2-108
https://www.proquest.com/docview/2494933669
Volume 45
WOSCitedRecordID wos000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: Library Science Database
  issn: 0943-7444
  databaseCode: M1O
  dateStart: 20180101
  customDbUrl:
  isFulltext: true
  dateEnd: 20221211
  titleUrlDefault: https://search.proquest.com/libraryscience
  omitProxy: false
  ssIdentifier: ssj0000627562
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  issn: 0943-7444
  databaseCode: BENPR
  dateStart: 20180101
  customDbUrl:
  isFulltext: true
  dateEnd: 20221211
  titleUrlDefault: https://www.proquest.com/central
  omitProxy: false
  ssIdentifier: ssj0000627562
  providerName: ProQuest
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8QwEB7U9eDFt7i6KzmIFwlumzRp8SAqigddxQd6K5s0gUXsVncV9NebyaY-wYvXNENLvpnJdJL5BmCTpUlhXKhMkRqF8kIx6jZBZ3idlGsWqSKynsT1VHa76d1ddhESbsNwrbL2id5RFwONOfKdGGlUGBMi26seKXaNwtPV0EJjEhrIVOb0vHFw1L24_MiyeBJe31UUb9BRyTkf188lUkY7H2NOVyKnMM4npd_2p0Y5eBgMf7lpv_ccz_33q-dhNkSdZH-sJgswYcpFaIeaBbJFQlESgkSCtS_B7m3_vl9hZQk5uT47JVeeavb5yZCayYQ4KXJe-ha4rwR7f9ZstMtwc3x0fXhCQ68FqlnMRpRbZaXsYTygjY11pmObKJkZF1HFXMmejYS0Wgljo1TrJLGcGZkhfYxyEYdlKzBVDkqzCkT1hI61skJ0DM8K4aZ3VMJii1RyTJsm0HqF82pMqZG7XxFEJEdEckQkR0TyGPlLm8A9DHmwruHvWW_96svgSORVYZuw_UPs75e0agA_BT7RW_v78TrMeA3yqZkWTLnFNm2Y1i-j_vBpIyjlBkyeRefvY1TlMA
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB7xqAQXKBTEttD6AFyQxcZ27ERVhaq2CMSyIEEFN7N2bGlVNZuyC4j-qP7GerwJ0CJx48DVsRPF883Dj_kGYJ1naeFCqEyRGoWKwnAanGBQvHYmLE9MkfhI4tpR3W52fp4fT8CfJhcGr1U2NjEa6mJgcY98myGNCudS5jvVL4pVo_B0tSmhMYbFgbu9CUu24af9r0G-G4ztfjv9skfrqgLUcsZHVHjjleqh57POM5tb5lOjchdiByaM6vlEKm-NdD7JrE1TL7hTORKlmOBbPQ_vnYRpESId1KvD5OhuTydS_sYapnhfjyohxDhbL1Uq2b5rC8hMAjyDBcz-8YbT5eDnYPjIKURPtzv_0uboNczVMTX5PFaCBZhw5SKs1RkZZJPUKVcIQVLbsjfw8az_o19h3gzZOz3skJNIpHt16UjD00LCKHJUxgK_twQrmzZcu0vw_Vn-ZxmmykHpVoCYnrTMGi9l24m8kKF726SceSTK49a1gDYS1dWYMESHhRYiQCMCNCJAIwI0Q3bWFogodl3bjuHjXr_71YPGkdRV4Vuw9d-wpz-y2gDmfsA9Wt4-_fgDzKAgdGe_e_AOZiN64ybUKkyFiXdr8Mpej_rDy_dRHQhcPDe2_gKqUEC7
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Wikipedia+HTML+Structure+Analysis+for+Ontology+Construction&rft.jtitle=Knowledge+organization&rft.au=Rim+Zarrad&rft.au=Doggaz%2C+Narjes&rft.au=Ezzedine+Zagrouba&rft.date=2018-01-01&rft.pub=Nomos+Verlagsgesellschaft+mbH+und+Co+KG&rft.issn=0943-7444&rft.volume=45&rft.issue=2&rft.spage=108&rft_id=info:doi/10.5771%2F0943-7444-2018-2-108
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0943-7444&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0943-7444&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0943-7444&client=summon