Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash

Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The m...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Data and information management Ročník 9; číslo 3; s. 100088
Hlavní autoři: Toprak, Ahmet, Turan, Metin
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 01.09.2025
Témata:
ISSN:2543-9251, 2543-9251
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation. •Automated model for creating thematic dictionaries using web-based resources.•Combines WordNet, Spacy, and Simhash to enhance dictionary coverage and similarity.•Successfully created a financial dictionary with stronger semantic representation.•Reduces time and effort in dictionary creation compared to manual methods.•Applicable for tasks like text classification, summarization, and theme detection.
AbstractList Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation. •Automated model for creating thematic dictionaries using web-based resources.•Combines WordNet, Spacy, and Simhash to enhance dictionary coverage and similarity.•Successfully created a financial dictionary with stronger semantic representation.•Reduces time and effort in dictionary creation compared to manual methods.•Applicable for tasks like text classification, summarization, and theme detection.
ArticleNumber 100088
Author Turan, Metin
Toprak, Ahmet
Author_xml – sequence: 1
  givenname: Ahmet
  orcidid: 0000-0001-7046-8512
  surname: Toprak
  fullname: Toprak, Ahmet
  email: ce.ahmet.toprak@gmail.com
– sequence: 2
  givenname: Metin
  orcidid: 0000-0002-1941-6693
  surname: Turan
  fullname: Turan, Metin
  email: mturan@ticaret.edu.tr
BookMark eNp9kE1OwzAQhS0EEqX0AOx8gKaM4yRuxKqq-JMqWLQSYmU5Y4e6apLKdkG9PQ5lgVh0NW9G843mvSty3natIeSGwYQBK243E22bSQppFnuA6fSMDNI840mZ5uz8j74kI-83cSWFEnjJBuR9tg9do4LRNKxNFBapthhs1yp3oOiM6jXde9t-9Cv0y1S0Uj4CcfzWOf1iwpgudwoPY6paTZe2WSu_viYXtdp6M_qtQ7J6uF_Nn5LF6-PzfLZIkGcQkqqGfIpcgMg5FgKhKDPGSy54aniuU1ZVAphQNbICVMUw1bVgOhMaFNfIh4Qdz6LrvHemljtnm_i7ZCD7dORGxnRkn448phMZ8Y9BG358Bqfs9iR5dyRNdPRpjZMerWnRaOsMBqk7e4L-Bjo2gF4
CitedBy_id crossref_primary_10_3390_covid5070099
Cites_doi 10.1016/j.sbspro.2015.02.373
10.1145/509907.509965
10.58729/1941-6679.1100
10.1111/j.1551-6709.2009.01068.x
10.21437/Eurospeech.1999-580
10.1016/j.amper.2017.11.001
10.1016/j.specom.2019.04.007
10.1037/0033-295X.104.2.211
10.1016/j.cmpbup.2021.100024
10.1016/j.eswa.2020.114547
10.1016/j.jbi.2015.06.016
10.18653/v1/N18-4018
10.3390/info11120578
10.1109/ACCESS.2023.3295776
10.5007/2175-7968.2016v36nesp1p62
10.1109/JCSSE.2012.6261916
10.1016/j.jbi.2013.07.011
10.15398/jlm.v2i1.78
ContentType Journal Article
Copyright 2024 The Authors
Copyright_xml – notice: 2024 The Authors
DBID 6I.
AAFTH
AAYXX
CITATION
DOI 10.1016/j.dim.2024.100088
DatabaseName ScienceDirect Open Access Titles
Elsevier:ScienceDirect:Open Access
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
EISSN 2543-9251
ExternalDocumentID 10_1016_j_dim_2024_100088
S254392512400024X
GroupedDBID 0R~
6I.
8FE
8FG
AAFTH
AALRI
AAXUO
AAYWO
ABUWG
ACVFH
ADBLJ
ADCNI
ADVLN
AEUPX
AFKRA
AFPUW
AIGII
AIKXB
AITUG
AKBMS
AKYEP
ALMA_UNASSIGNED_HOLDINGS
ALSLI
AMRAJ
ARAPS
AZQEC
BENPR
BGLVJ
BPHCQ
CCPQU
CNYFK
DWQXO
EBS
EJD
FDB
GNUQQ
GROUPED_DOAJ
HCIFZ
K6V
K7-
M1O
M41
M~E
OK1
P62
PHGZM
PHGZT
PIMPY
PQGLB
PQQKQ
PROAC
PRQQA
PUEGO
ROL
SLJYH
AAYXX
AFFHD
CITATION
ID FETCH-LOGICAL-c340t-bf058c370753c67c06941393732e35d21bb7017afc160ab1c2df71d47d0a3dc3
ISICitedReferencesCount 2
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001591583800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2543-9251
IngestDate Sat Nov 29 07:35:19 EST 2025
Tue Nov 18 21:26:14 EST 2025
Sat Sep 27 17:13:33 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Keywords Automatic dictionary creation
Big data
Financial dictionary
Text similarity algorithms
Natural Language Processing
Language English
License This is an open access article under the CC BY license.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c340t-bf058c370753c67c06941393732e35d21bb7017afc160ab1c2df71d47d0a3dc3
ORCID 0000-0002-1941-6693
0000-0001-7046-8512
OpenAccessLink http://dx.doi.org/10.1016/j.dim.2024.100088
ParticipantIDs crossref_primary_10_1016_j_dim_2024_100088
crossref_citationtrail_10_1016_j_dim_2024_100088
elsevier_sciencedirect_doi_10_1016_j_dim_2024_100088
PublicationCentury 2000
PublicationDate September 2025
2025-09-00
PublicationDateYYYYMMDD 2025-09-01
PublicationDate_xml – month: 09
  year: 2025
  text: September 2025
PublicationDecade 2020
PublicationTitle Data and information management
PublicationYear 2025
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References (pp. 86–95).
spaCy (b31) 2022
(pp. 811–816).
Zotova, Agerri, Rigau (b40) 2021; 170
Hambarde, Proença (b9) 2023; 11
Oxford (b27) 2023
Alla, Chen, Batongbacal, Nekkantti, Dai, Jonnagaddala (b1) 2021; 1
Schuppler, B., Hagmüller, M., Morales-Cordovilla, J., & Pessentheiner, H. (2017). GRASS: The Graz Corpus of Read and Spontaneous Speech. In
Bertin, M., & Atanassova, I. (2018). InTeReC: In-text Reference Corpus for Applying Natural Language Processing to Bibliometrics. In
Këpuska, Rojanasthien (b14) 2011; 20
Sreeram, Dhawan, Sinha (b33) 2019; 110
Vijay, D., Bohra, A., Singh, V., Akhtar, S., & Shrivastava, M. (2018). Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text. In
Beáta, M., Jesper, N., & Anne, P. (2015). The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. In
Landauer, Dumais (b17) 1997; 104
Vorapatratorn, S., Suchato, A., & Punyabukkana, P. (2012). Automatic online text selection for constructing text corpus with custom phonetic distribution. In
Kosem, Koppel, Zingano Kuhn, Michelfeit, Tiberius (b16) 2018; 32
Ellen, R. (1993). Automatically Constructing a Dictionary for Information Extraction Tasks. In
Silverman, K., Anderson, V., Bellegarda, J., Lenzo, K., & Naik, D. (1999). Design and collection of a corpus of polyphones and prosodic contexts for speech synthesis research and development. In
(pp. 2707–2708).
(pp. 3192–3199).
WordNet (b39) 2022
Kennedy, Szpakowicz (b13) 2014; 2
Jarmasz (b12) 2003
Amazon (b2) 2023
(pp. 54–62).
(pp. 380–388).
Leemann, Kolly, Britain (b18) 2018; 5
Lund, K., Burgess, C., & Atchley, R. (1995). Semantic and associative priming in high-dimensional semantic space. In
Koeva, S., Stoyanova, I., Todorova, M., & Leseva, S. (2016). Semi-automatic Compilation of the Dictionary of Bulgarian Multiword Expressions. In
Votesmart (b38) 2022
Oxford (b26) 2023
Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. In
Szpakowicz, Kennedy (b34) 2012
Investopedia (b11) 2023
(pp. 128–135).
Oronoz, Gojenola, Pérez, Ilarraza, Casillas (b24) 2015; 56
(pp. 660–665).
Fantinuoli (b8) 2015; 36
(pp. 1465–1470).
SportsDefinitions (b32) 2022
Herrero-Zazo, Segura-Bedmar, Martinez, Declerck (b10) 2013; 46
Moreno-García (b22) 2020; 11
Turan, Sonmez (b35) 2015; 177
Baroni, Murphy, Barbu, Poesio (b3) 2010; 34
Rydning (b28) 2023
Oxford (b25) 2023
(pp. 6–11).
Miguel, Segura-Bedmar, Chacón-Solano, Guerrero-Aspizua (b21) 2022; 125
McHale, Myaeng (b20) 1996
Nadzurah, Ismail, Emran (b23) 2018; 9
Rydning (10.1016/j.dim.2024.100088_b28) 2023
Amazon (10.1016/j.dim.2024.100088_b2) 2023
Investopedia (10.1016/j.dim.2024.100088_b11) 2023
Landauer (10.1016/j.dim.2024.100088_b17) 1997; 104
Jarmasz (10.1016/j.dim.2024.100088_b12) 2003
WordNet (10.1016/j.dim.2024.100088_b39) 2022
Fantinuoli (10.1016/j.dim.2024.100088_b8) 2015; 36
Oxford (10.1016/j.dim.2024.100088_b26) 2023
Baroni (10.1016/j.dim.2024.100088_b3) 2010; 34
Oronoz (10.1016/j.dim.2024.100088_b24) 2015; 56
McHale (10.1016/j.dim.2024.100088_b20) 1996
Moreno-García (10.1016/j.dim.2024.100088_b22) 2020; 11
Votesmart (10.1016/j.dim.2024.100088_b38) 2022
10.1016/j.dim.2024.100088_b29
Nadzurah (10.1016/j.dim.2024.100088_b23) 2018; 9
SportsDefinitions (10.1016/j.dim.2024.100088_b32) 2022
Szpakowicz (10.1016/j.dim.2024.100088_b34) 2012
Sreeram (10.1016/j.dim.2024.100088_b33) 2019; 110
10.1016/j.dim.2024.100088_b36
10.1016/j.dim.2024.100088_b15
10.1016/j.dim.2024.100088_b37
Hambarde (10.1016/j.dim.2024.100088_b9) 2023; 11
Herrero-Zazo (10.1016/j.dim.2024.100088_b10) 2013; 46
Këpuska (10.1016/j.dim.2024.100088_b14) 2011; 20
Oxford (10.1016/j.dim.2024.100088_b27) 2023
10.1016/j.dim.2024.100088_b30
Miguel (10.1016/j.dim.2024.100088_b21) 2022; 125
Turan (10.1016/j.dim.2024.100088_b35) 2015; 177
10.1016/j.dim.2024.100088_b7
Leemann (10.1016/j.dim.2024.100088_b18) 2018; 5
10.1016/j.dim.2024.100088_b6
Kosem (10.1016/j.dim.2024.100088_b16) 2018; 32
10.1016/j.dim.2024.100088_b19
spaCy (10.1016/j.dim.2024.100088_b31) 2022
Alla (10.1016/j.dim.2024.100088_b1) 2021; 1
10.1016/j.dim.2024.100088_b5
Kennedy (10.1016/j.dim.2024.100088_b13) 2014; 2
Oxford (10.1016/j.dim.2024.100088_b25) 2023
Zotova (10.1016/j.dim.2024.100088_b40) 2021; 170
10.1016/j.dim.2024.100088_b4
References_xml – year: 2022
  ident: b39
  article-title: What is WordNet
– reference: (pp. 2707–2708).
– reference: Beáta, M., Jesper, N., & Anne, P. (2015). The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. In
– reference: Ellen, R. (1993). Automatically Constructing a Dictionary for Information Extraction Tasks. In
– start-page: 137
  year: 1996
  end-page: 146
  ident: b20
  article-title: Extraction of thematic roles from dictionary definitions
  publication-title: Language, information and computation : Selected papers from the 11th pacific Asia conference on language, information and computation
– reference: Vijay, D., Bohra, A., Singh, V., Akhtar, S., & Shrivastava, M. (2018). Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text. In
– reference: Bertin, M., & Atanassova, I. (2018). InTeReC: In-text Reference Corpus for Applying Natural Language Processing to Bibliometrics. In
– year: 2023
  ident: b26
  article-title: A dictionary of finance and banking
– reference: Silverman, K., Anderson, V., Bellegarda, J., Lenzo, K., & Naik, D. (1999). Design and collection of a corpus of polyphones and prosodic contexts for speech synthesis research and development. In
– year: 2023
  ident: b27
  article-title: The Oxford dictionary of sports science and medicine
– year: 2023
  ident: b2
  article-title: Wall street dictionary: The most up-to-date and authoritative dictionary of financial terms, more than 4000 entries paperback
– volume: 5
  start-page: 1
  year: 2018
  end-page: 17
  ident: b18
  article-title: The english dialects app: The creation of a crowdsourced dialect corpus
  publication-title: Ampersand
– year: 2022
  ident: b32
  article-title: Browse by sport
– reference: Schuppler, B., Hagmüller, M., Morales-Cordovilla, J., & Pessentheiner, H. (2017). GRASS: The Graz Corpus of Read and Spontaneous Speech. In
– reference: (pp. 1465–1470).
– volume: 46
  start-page: 914
  year: 2013
  end-page: 920
  ident: b10
  article-title: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
  publication-title: Journal of Biomedical Informatics
– year: 2023
  ident: b28
  article-title: Worldwide IDC global datasphere forecast, 2023–2027: It’s a distributed, diverse, and dynamic (3D) datasphere
– reference: (pp. 811–816).
– reference: Vorapatratorn, S., Suchato, A., & Punyabukkana, P. (2012). Automatic online text selection for constructing text corpus with custom phonetic distribution. In
– reference: (pp. 6–11).
– reference: (pp. 380–388).
– volume: 32
  year: 2018
  ident: b16
  article-title: Identification and automatic extraction of good dictionary examples: The case(s) of GDEX
  publication-title: International Journal of Lexicography
– volume: 36
  start-page: 62
  year: 2015
  end-page: 87
  ident: b8
  article-title: Revisiting corpus creation and analysis tools for translation tasks
  publication-title: Cadernos de Tradução
– volume: 104
  start-page: 211
  year: 1997
  end-page: 240
  ident: b17
  article-title: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge
  publication-title: Psychological Review
– volume: 1
  year: 2021
  ident: b1
  article-title: Cohort selection for construction of a clinical natural language processing corpus
  publication-title: Computer Methods and Programs in Biomedicine Update
– reference: (pp. 660–665).
– year: 2022
  ident: b38
  article-title: Political glossary
– reference: (pp. 3192–3199).
– volume: 2
  start-page: 1
  year: 2014
  ident: b13
  article-title: Evaluation of automatic updates of Roget’s Thesaurus
  publication-title: Journal of Language Modelling
– volume: 9
  start-page: 442
  year: 2018
  end-page: 447
  ident: b23
  article-title: Performance analysis of machine learning algorithms for missing value imputation
  publication-title: International Journal of Advanced Computer Science and Applications
– volume: 34
  start-page: 222
  year: 2010
  end-page: 254
  ident: b3
  article-title: Strudel: A corpus-based semantic model based on properties and types
  publication-title: Cognitive Science
– volume: 110
  start-page: 76
  year: 2019
  end-page: 89
  ident: b33
  article-title: IITG-HingCoS corpus: A hinglish code-switching database for automatic speech recognition
  publication-title: Speech Communication
– volume: 11
  year: 2020
  ident: b22
  article-title: Information retrieval and social media mining
  publication-title: Information
– year: 2023
  ident: b11
  article-title: Financial terms dictionary
– start-page: 1
  year: 2012
  end-page: 228
  ident: b34
  article-title: Automatic supervised thesauri construction with Roget’s thesaurus
  publication-title: University of ottawa
– reference: (pp. 86–95).
– volume: 20
  start-page: 49
  year: 2011
  end-page: 82
  ident: b14
  article-title: Speech corpus generation from DVDs of mov-ies and TV series
  publication-title: Journal of International Technology and Information Management
– reference: Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. In
– reference: Lund, K., Burgess, C., & Atchley, R. (1995). Semantic and associative priming in high-dimensional semantic space. In
– reference: (pp. 54–62).
– reference: Koeva, S., Stoyanova, I., Todorova, M., & Leseva, S. (2016). Semi-automatic Compilation of the Dictionary of Bulgarian Multiword Expressions. In
– volume: 56
  start-page: 318
  year: 2015
  end-page: 332
  ident: b24
  article-title: On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions
  publication-title: Journal of Biomedical Informatics
– volume: 125
  year: 2022
  ident: b21
  article-title: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms
  publication-title: Journal of Biomedical Informatics
– year: 2023
  ident: b25
  article-title: The concise oxford dictionary of politics
– volume: 170
  year: 2021
  ident: b40
  article-title: Semi-automatic generation of multilingual datasets for stance detection in Twitter
  publication-title: Expert Systems with Applications
– volume: 11
  start-page: 76581
  year: 2023
  end-page: 76604
  ident: b9
  article-title: Information retrieval: Recent advances and beyond
  publication-title: IEEE Access
– reference: (pp. 128–135).
– volume: 177
  start-page: 169
  year: 2015
  end-page: 177
  ident: b35
  article-title: Automatize document topic and subtopic detection with support of a corpus
  publication-title: Procedia - Social and Behavioral Sciences
– start-page: 1
  year: 2003
  end-page: 231
  ident: b12
  article-title: Roget’s thesaurus as a lexical resource for natural language processing
– year: 2022
  ident: b31
  article-title: spaCy 101: Everything you need to know
– volume: 177
  start-page: 169
  year: 2015
  ident: 10.1016/j.dim.2024.100088_b35
  article-title: Automatize document topic and subtopic detection with support of a corpus
  publication-title: Procedia - Social and Behavioral Sciences
  doi: 10.1016/j.sbspro.2015.02.373
– start-page: 137
  year: 1996
  ident: 10.1016/j.dim.2024.100088_b20
  article-title: Extraction of thematic roles from dictionary definitions
– ident: 10.1016/j.dim.2024.100088_b6
  doi: 10.1145/509907.509965
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b27
– volume: 20
  start-page: 49
  year: 2011
  ident: 10.1016/j.dim.2024.100088_b14
  article-title: Speech corpus generation from DVDs of mov-ies and TV series
  publication-title: Journal of International Technology and Information Management
  doi: 10.58729/1941-6679.1100
– volume: 34
  start-page: 222
  year: 2010
  ident: 10.1016/j.dim.2024.100088_b3
  article-title: Strudel: A corpus-based semantic model based on properties and types
  publication-title: Cognitive Science
  doi: 10.1111/j.1551-6709.2009.01068.x
– ident: 10.1016/j.dim.2024.100088_b30
  doi: 10.21437/Eurospeech.1999-580
– volume: 5
  start-page: 1
  year: 2018
  ident: 10.1016/j.dim.2024.100088_b18
  article-title: The english dialects app: The creation of a crowdsourced dialect corpus
  publication-title: Ampersand
  doi: 10.1016/j.amper.2017.11.001
– volume: 110
  start-page: 76
  year: 2019
  ident: 10.1016/j.dim.2024.100088_b33
  article-title: IITG-HingCoS corpus: A hinglish code-switching database for automatic speech recognition
  publication-title: Speech Communication
  doi: 10.1016/j.specom.2019.04.007
– start-page: 1
  year: 2003
  ident: 10.1016/j.dim.2024.100088_b12
– volume: 104
  start-page: 211
  year: 1997
  ident: 10.1016/j.dim.2024.100088_b17
  article-title: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge
  publication-title: Psychological Review
  doi: 10.1037/0033-295X.104.2.211
– volume: 1
  year: 2021
  ident: 10.1016/j.dim.2024.100088_b1
  article-title: Cohort selection for construction of a clinical natural language processing corpus
  publication-title: Computer Methods and Programs in Biomedicine Update
  doi: 10.1016/j.cmpbup.2021.100024
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b28
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b2
– volume: 170
  year: 2021
  ident: 10.1016/j.dim.2024.100088_b40
  article-title: Semi-automatic generation of multilingual datasets for stance detection in Twitter
  publication-title: Expert Systems with Applications
  doi: 10.1016/j.eswa.2020.114547
– ident: 10.1016/j.dim.2024.100088_b5
– volume: 56
  start-page: 318
  year: 2015
  ident: 10.1016/j.dim.2024.100088_b24
  article-title: On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions
  publication-title: Journal of Biomedical Informatics
  doi: 10.1016/j.jbi.2015.06.016
– volume: 9
  start-page: 442
  year: 2018
  ident: 10.1016/j.dim.2024.100088_b23
  article-title: Performance analysis of machine learning algorithms for missing value imputation
  publication-title: International Journal of Advanced Computer Science and Applications
– ident: 10.1016/j.dim.2024.100088_b36
  doi: 10.18653/v1/N18-4018
– ident: 10.1016/j.dim.2024.100088_b7
– ident: 10.1016/j.dim.2024.100088_b19
– year: 2022
  ident: 10.1016/j.dim.2024.100088_b38
– volume: 11
  issue: 12
  year: 2020
  ident: 10.1016/j.dim.2024.100088_b22
  article-title: Information retrieval and social media mining
  publication-title: Information
  doi: 10.3390/info11120578
– ident: 10.1016/j.dim.2024.100088_b15
– year: 2022
  ident: 10.1016/j.dim.2024.100088_b32
– volume: 11
  start-page: 76581
  year: 2023
  ident: 10.1016/j.dim.2024.100088_b9
  article-title: Information retrieval: Recent advances and beyond
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2023.3295776
– start-page: 1
  year: 2012
  ident: 10.1016/j.dim.2024.100088_b34
  article-title: Automatic supervised thesauri construction with Roget’s thesaurus
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b11
– volume: 36
  start-page: 62
  year: 2015
  ident: 10.1016/j.dim.2024.100088_b8
  article-title: Revisiting corpus creation and analysis tools for translation tasks
  publication-title: Cadernos de Tradução
  doi: 10.5007/2175-7968.2016v36nesp1p62
– volume: 125
  year: 2022
  ident: 10.1016/j.dim.2024.100088_b21
  article-title: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms
  publication-title: Journal of Biomedical Informatics
– ident: 10.1016/j.dim.2024.100088_b37
  doi: 10.1109/JCSSE.2012.6261916
– ident: 10.1016/j.dim.2024.100088_b29
– year: 2022
  ident: 10.1016/j.dim.2024.100088_b39
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b25
– year: 2023
  ident: 10.1016/j.dim.2024.100088_b26
– volume: 46
  start-page: 914
  year: 2013
  ident: 10.1016/j.dim.2024.100088_b10
  article-title: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions
  publication-title: Journal of Biomedical Informatics
  doi: 10.1016/j.jbi.2013.07.011
– volume: 2
  start-page: 1
  year: 2014
  ident: 10.1016/j.dim.2024.100088_b13
  article-title: Evaluation of automatic updates of Roget’s Thesaurus
  publication-title: Journal of Language Modelling
  doi: 10.15398/jlm.v2i1.78
– ident: 10.1016/j.dim.2024.100088_b4
– volume: 32
  year: 2018
  ident: 10.1016/j.dim.2024.100088_b16
  article-title: Identification and automatic extraction of good dictionary examples: The case(s) of GDEX
  publication-title: International Journal of Lexicography
– year: 2022
  ident: 10.1016/j.dim.2024.100088_b31
SSID ssj0002090391
Score 2.316064
Snippet Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 100088
SubjectTerms Automatic dictionary creation
Big data
Financial dictionary
Natural Language Processing
Text similarity algorithms
Title Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash
URI https://dx.doi.org/10.1016/j.dim.2024.100088
Volume 9
WOSCitedRecordID wos001591583800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVAON
  databaseName: DOAJ Directory of Open Access Journals
  customDbUrl:
  eissn: 2543-9251
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002090391
  issn: 2543-9251
  databaseCode: DOA
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://www.doaj.org/
  providerName: Directory of Open Access Journals
– providerCode: PRVHPJ
  databaseName: ROAD: Directory of Open Access Scholarly Resources
  customDbUrl:
  eissn: 2543-9251
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002090391
  issn: 2543-9251
  databaseCode: M~E
  dateStart: 20170101
  isFulltext: true
  titleUrlDefault: https://road.issn.org
  providerName: ISSN International Centre
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZgcOAyDQFiwCYfOKEFOXESJ8eKH-LAKqRFUE6RYztqJ5ZWbTqNC38779mOU00FARKXqLLsuPJnvTw_f-97hLxkXBrVaBYpVfAoFXkZSY2AiFY1OVdpZkuyfP4optNiNis_-XKpG1tOQHRdcXNTrv4r1NAGYGPq7F_AHV4KDfAbQIcnwA7PPwJ-su2X4IYaS4x0gqx6YbMXkCAXnMTtZkiUwmxF_JhpvDj4AofRqbuguIDj9PfA7lxczeVmvuvLvpW99OJNIQXSs2F36TTVcrWW1uZO5ldjunW1XbvY63lQ__bBhyQL7CofERuyYkYKEhguzK-PysQLyZo9bd7yljsbjO-15y60cPlaL1A1IEmR1MFcHcBbMtkXOANOgKRY6Dm7S-4lIivR0p3_GONuCQambCHF8JeGy25L-7s10353ZccFqY7IoT870InD_CG5Y7pH5GvAmw540xFvOuBNLd7YhQLe1OJNodnjfUYt2mcU8KQe68ekev-uevMh8vUyIsVT1kdNy7JCcQFeIFe5UJjTHKPgIU8Mz3QSN40AAyxbFedMNrFKdCtinQrNJNeKPyEH3bIzTwnNi5wziWqOMkuF1ejXZQH23WR5Kk17TNiwKrXyWvJY0uRbPZAGL2tYyBoXsnYLeUxehSErJ6Tyu87psNS19wSdh1fDxvj1sGf_Nuw5eTDu7RfkoF9vzQm5r677xWZ9aqM0p3YT_QTne4DD
linkProvider ISSN International Centre
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automated+thematic+dictionary+creation+using+the+web+based+on+WordNet%2C+Spacy%2C+and+Simhash&rft.jtitle=Data+and+information+management&rft.au=Toprak%2C+Ahmet&rft.au=Turan%2C+Metin&rft.date=2025-09-01&rft.pub=Elsevier+Ltd&rft.issn=2543-9251&rft.eissn=2543-9251&rft.volume=9&rft.issue=3&rft_id=info:doi/10.1016%2Fj.dim.2024.100088&rft.externalDocID=S254392512400024X
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2543-9251&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2543-9251&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2543-9251&client=summon