Wikipedia HTML Structure Analysis for Ontology Construction
Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology...
Saved in:
| Published in: | Knowledge organization Vol. 45; no. 2; pp. 108 - 124 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Baden-Baden
Nomos Verlagsgesellschaft mbH & Co. KG
01.01.2018
Nomos Verlagsgesellschaft mbH und Co KG |
| Subjects: | |
| ISSN: | 0943-7444 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards. |
|---|---|
| AbstractList | Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent them in order to deduce information. Ontologies are considered suitable solutions for organizing information. The classic methods for ontology construction from textual documents rely on natural language analysis and are generally based on statistical or linguistic approaches. However, these approaches do not consider the document structure which provides additional knowledge. In fact, the structural organization of documents also conveys meaning. In this context, new approaches focus on document structure analysis to extract knowledge. This paper describes a methodology for ontology construction from web data and especially from Wikipedia articles. It focuses mainly on document structure in order to extract the main concepts and their relations. The proposed methods extract not only taxonomic and non-taxonomic relations but also give the labels describing non-taxonomic relations. The extraction of non-taxonomic relations is established by analyzing the titles hierarchy in each document. A pattern matching is also applied in order to extract known semantic relations. We propose also to apply a refinement to the extracted relations in order to keep only those that are relevant. The refinement process is performed by applying the transitive property, checking the nature of the relations and analyzing taxonomic relations having inverted arguments. Experiments have been performed on French Wikipedia articles related to the medical field. Ontology evaluation is performed by comparing it to gold standards. |
| Author | Doggaz, Narjes Zagrouba, and Ezzedine Zarrad, Rim |
| Author_xml | – sequence: 1 givenname: Rim surname: Zarrad fullname: Zarrad, Rim – sequence: 2 givenname: Narjes surname: Doggaz fullname: Doggaz, Narjes – sequence: 3 givenname: and Ezzedine surname: Zagrouba fullname: Zagrouba, and Ezzedine |
| BookMark | eNqFkM1KAzEUhbOoYK19AxcDrqP5nUzoqhS1QqULKy7DTJpIapuMycyiPn1nrKig4N1cuPecw-E7AwMfvAHgAqMrLgS-RpJRKBhjkCBcQAIxKgZg-HU-BeOUNqibnAiekyGYPLtXV5u1K7P56mGRPTax1U0bTTb15XafXMpsiNnSN2EbXvbZLPj0IXHBn4MTW26TGX_uEXi6vVnN5nCxvLufTRdQU0IbyGxlhSiLrpI2lmipieWVkIZzTFglSotzYXWVG4sLrTm3jBohEeKokhhZOgKXx9w6hrfWpEZtQhu7ekkRJpmkNM9lp2JHlY4hpWisqqPblXGvMFI9HdVjUD0G1dNRpHsUnY0cbT7sQvpO_sc0-dP0S_zu6h_HJlf12tID5_98tg |
| ContentType | Journal Article |
| Copyright | Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018 |
| Copyright_xml | – notice: Copyright Nomos Verlagsgesellschaft mbH und Co KG 2018 |
| DBID | AAYXX CITATION ABUWG AFKRA ALSLI BENPR CCPQU CNYFK DWQXO E3H F2A M1O PHGZM PHGZT PKEHL PQEST PQQKQ PQUKI PRINS PRQQA |
| DOI | 10.5771/0943-7444-2018-2-108 |
| DatabaseName | CrossRef ProQuest Central (Alumni) ProQuest Central UK/Ireland Social Science Premium Collection ProQuest Central ProQuest One Community College Library & information science collection. ProQuest Central Korea Library & Information Sciences Abstracts (LISA) Library & Information Science Abstracts (LISA) Library Science Database ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China ProQuest One Social Sciences |
| DatabaseTitle | CrossRef Social Science Premium Collection ProQuest One Social Sciences ProQuest One Academic Middle East (New) Library and Information Science Abstracts (LISA) ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) ProQuest One Community College ProQuest Central China ProQuest Central ProQuest Library Science ProQuest One Academic UKI Edition ProQuest Central Korea Library & Information Science Collection ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) |
| DatabaseTitleList | Social Science Premium Collection |
| Database_xml | – sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Library & Information Science |
| EndPage | 124 |
| ExternalDocumentID | 10_5771_0943_7444_2018_2_108 |
| GroupedDBID | .4I .4S .DC 0B8 0R1 5GY 77K AAFWJ ABUWG ACHQT ADMHG AFKRA ALMA_UNASSIGNED_HOLDINGS ALSLI ARCSS BENPR CCPQU CNYFK DWQXO EBS EDO EJD ELW I-F M1O P.N TUS 77I AAYXX AFFHD CITATION FRA PHGZM PHGZT PRQQA E3H F2A PKEHL PQEST PQQKQ PQUKI PRINS |
| ID | FETCH-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3 |
| IEDL.DBID | BENPR |
| ISICitedReferencesCount | 1 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0943-7444 |
| IngestDate | Sat Nov 15 05:42:43 EST 2025 Sat Nov 29 08:05:04 EST 2025 Sat Nov 02 17:46:54 EDT 2024 Wed Jan 08 03:46:09 EST 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 2 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c323t-4fbf77a8201cef2c9c2f5b79e55124b7af167fcb6ef18cc55f43e790050b910f3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| PQID | 2494933669 |
| PQPubID | 2035306 |
| PageCount | 17 |
| ParticipantIDs | crossref_primary_10_5771_0943_7444_2018_2_108 nomos_journals_10_5771_0943_7444_2018_2_108 proquest_journals_2494933669 |
| PublicationCentury | 2000 |
| PublicationDate | 2018-01-01 |
| PublicationDateYYYYMMDD | 2018-01-01 |
| PublicationDate_xml | – month: 01 year: 2018 text: 2018-01-01 day: 01 |
| PublicationDecade | 2010 |
| PublicationPlace | Baden-Baden |
| PublicationPlace_xml | – name: Baden-Baden – name: Wuerzburg |
| PublicationTitle | Knowledge organization |
| PublicationTitleAlternate | KO |
| PublicationYear | 2018 |
| Publisher | Nomos Verlagsgesellschaft mbH & Co. KG Nomos Verlagsgesellschaft mbH und Co KG |
| Publisher_xml | – name: Nomos Verlagsgesellschaft mbH & Co. KG – name: Nomos Verlagsgesellschaft mbH und Co KG |
| SSID | ssj0000627562 |
| Score | 2.0760248 |
| Snippet | Previously, the main problem of information extraction was to gather enough data. Today, the challenge is not to collect data but to interpret and represent... |
| SourceID | proquest crossref nomos |
| SourceType | Aggregation Database Index Database Enrichment Source Publisher |
| StartPage | 108 |
| SubjectTerms | Classification Data Experiments Extraction French language Information Information retrieval Information sources Internet Natural language Ontology Organizational structure Semantic relations Semantics Taxonomy |
| Title | Wikipedia HTML Structure Analysis for Ontology Construction |
| URI | https://www.nomos-elibrary.de/index.php?doi=10.5771/0943-7444-2018-2-108 https://www.proquest.com/docview/2494933669 |
| Volume | 45 |
| WOSCitedRecordID | wos000430455800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVPQU databaseName: Library Science Database issn: 0943-7444 databaseCode: M1O dateStart: 20180101 customDbUrl: isFulltext: true dateEnd: 20221211 titleUrlDefault: https://search.proquest.com/libraryscience omitProxy: false ssIdentifier: ssj0000627562 providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central issn: 0943-7444 databaseCode: BENPR dateStart: 20180101 customDbUrl: isFulltext: true dateEnd: 20221211 titleUrlDefault: https://www.proquest.com/central omitProxy: false ssIdentifier: ssj0000627562 providerName: ProQuest |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1LS8QwEB7U9eDFt7i6KzmIFwlumzRp8SAqigddxQd6K5s0gUXsVncV9NebyaY-wYvXNENLvpnJdJL5BmCTpUlhXKhMkRqF8kIx6jZBZ3idlGsWqSKynsT1VHa76d1ddhESbsNwrbL2id5RFwONOfKdGGlUGBMi26seKXaNwtPV0EJjEhrIVOb0vHFw1L24_MiyeBJe31UUb9BRyTkf188lUkY7H2NOVyKnMM4npd_2p0Y5eBgMf7lpv_ccz_33q-dhNkSdZH-sJgswYcpFaIeaBbJFQlESgkSCtS_B7m3_vl9hZQk5uT47JVeeavb5yZCayYQ4KXJe-ha4rwR7f9ZstMtwc3x0fXhCQ68FqlnMRpRbZaXsYTygjY11pmObKJkZF1HFXMmejYS0Wgljo1TrJLGcGZkhfYxyEYdlKzBVDkqzCkT1hI61skJ0DM8K4aZ3VMJii1RyTJsm0HqF82pMqZG7XxFEJEdEckQkR0TyGPlLm8A9DHmwruHvWW_96svgSORVYZuw_UPs75e0agA_BT7RW_v78TrMeA3yqZkWTLnFNm2Y1i-j_vBpIyjlBkyeRefvY1TlMA |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB7xqAQXKBTEttD6AFyQxcZ27ERVhaq2CMSyIEEFN7N2bGlVNZuyC4j-qP7GerwJ0CJx48DVsRPF883Dj_kGYJ1naeFCqEyRGoWKwnAanGBQvHYmLE9MkfhI4tpR3W52fp4fT8CfJhcGr1U2NjEa6mJgcY98myGNCudS5jvVL4pVo_B0tSmhMYbFgbu9CUu24af9r0G-G4ztfjv9skfrqgLUcsZHVHjjleqh57POM5tb5lOjchdiByaM6vlEKm-NdD7JrE1TL7hTORKlmOBbPQ_vnYRpESId1KvD5OhuTydS_sYapnhfjyohxDhbL1Uq2b5rC8hMAjyDBcz-8YbT5eDnYPjIKURPtzv_0uboNczVMTX5PFaCBZhw5SKs1RkZZJPUKVcIQVLbsjfw8az_o19h3gzZOz3skJNIpHt16UjD00LCKHJUxgK_twQrmzZcu0vw_Vn-ZxmmykHpVoCYnrTMGi9l24m8kKF726SceSTK49a1gDYS1dWYMESHhRYiQCMCNCJAIwI0Q3bWFogodl3bjuHjXr_71YPGkdRV4Vuw9d-wpz-y2gDmfsA9Wt4-_fgDzKAgdGe_e_AOZiN64ybUKkyFiXdr8Mpej_rDy_dRHQhcPDe2_gKqUEC7 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Wikipedia+HTML+Structure+Analysis+for+Ontology+Construction&rft.jtitle=Knowledge+organization&rft.au=Rim+Zarrad&rft.au=Doggaz%2C+Narjes&rft.au=Ezzedine+Zagrouba&rft.date=2018-01-01&rft.pub=Nomos+Verlagsgesellschaft+mbH+und+Co+KG&rft.issn=0943-7444&rft.volume=45&rft.issue=2&rft.spage=108&rft_id=info:doi/10.5771%2F0943-7444-2018-2-108 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0943-7444&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0943-7444&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0943-7444&client=summon |