Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash

Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The m...

Celý popis

Uloženo v:

Podrobná bibliografie
Vydáno v:	Data and information management Ročník 9; číslo 3; s. 100088
Hlavní autoři:	Toprak, Ahmet, Turan, Metin
Médium:	Journal Article
Jazyk:	angličtina
Vydáno:	Elsevier Ltd 01.09.2025
Témata:	Automatic dictionary creation Big data Financial dictionary Natural Language Processing Text similarity algorithms Automatic dictionary creation Big data Financial dictionary Text similarity algorithms Natural Language Processing
ISSN:	2543-9251, 2543-9251
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Abstract	Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation. •Automated model for creating thematic dictionaries using web-based resources.•Combines WordNet, Spacy, and Simhash to enhance dictionary coverage and similarity.•Successfully created a financial dictionary with stronger semantic representation.•Reduces time and effort in dictionary creation compared to manual methods.•Applicable for tasks like text classification, summarization, and theme detection.
AbstractList	Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation. •Automated model for creating thematic dictionaries using web-based resources.•Combines WordNet, Spacy, and Simhash to enhance dictionary coverage and similarity.•Successfully created a financial dictionary with stronger semantic representation.•Reduces time and effort in dictionary creation compared to manual methods.•Applicable for tasks like text classification, summarization, and theme detection.
ArticleNumber	100088
Author	Turan, Metin Toprak, Ahmet
Author_xml	– sequence: 1 givenname: Ahmet orcidid: 0000-0001-7046-8512 surname: Toprak fullname: Toprak, Ahmet email: ce.ahmet.toprak@gmail.com – sequence: 2 givenname: Metin orcidid: 0000-0002-1941-6693 surname: Turan fullname: Turan, Metin email: mturan@ticaret.edu.tr
BookMark	eNp9kE1OwzAQhS0EEqX0AOx8gKaM4yRuxKqq-JMqWLQSYmU5Y4e6apLKdkG9PQ5lgVh0NW9G843mvSty3natIeSGwYQBK243E22bSQppFnuA6fSMDNI840mZ5uz8j74kI-83cSWFEnjJBuR9tg9do4LRNKxNFBapthhs1yp3oOiM6jXde9t-9Cv0y1S0Uj4CcfzWOf1iwpgudwoPY6paTZe2WSu_viYXtdp6M_qtQ7J6uF_Nn5LF6-PzfLZIkGcQkqqGfIpcgMg5FgKhKDPGSy54aniuU1ZVAphQNbICVMUw1bVgOhMaFNfIh4Qdz6LrvHemljtnm_i7ZCD7dORGxnRkn448phMZ8Y9BG358Bqfs9iR5dyRNdPRpjZMerWnRaOsMBqk7e4L-Bjo2gF4
CitedBy_id	crossref_primary_10_3390_covid5070099
Cites_doi	10.1016/j.sbspro.2015.02.373 10.1145/509907.509965 10.58729/1941-6679.1100 10.1111/j.1551-6709.2009.01068.x 10.21437/Eurospeech.1999-580 10.1016/j.amper.2017.11.001 10.1016/j.specom.2019.04.007 10.1037/0033-295X.104.2.211 10.1016/j.cmpbup.2021.100024 10.1016/j.eswa.2020.114547 10.1016/j.jbi.2015.06.016 10.18653/v1/N18-4018 10.3390/info11120578 10.1109/ACCESS.2023.3295776 10.5007/2175-7968.2016v36nesp1p62 10.1109/JCSSE.2012.6261916 10.1016/j.jbi.2013.07.011 10.15398/jlm.v2i1.78
ContentType	Journal Article
Copyright	2024 The Authors
Copyright_xml	– notice: 2024 The Authors
DBID	6I. AAFTH AAYXX CITATION
DOI	10.1016/j.dim.2024.100088
DatabaseName	ScienceDirect Open Access Titles Elsevier:ScienceDirect:Open Access CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
EISSN	2543-9251
ExternalDocumentID	10_1016_j_dim_2024_100088 S254392512400024X
GroupedDBID	0R~ 6I. 8FE 8FG AAFTH AALRI AAXUO AAYWO ABUWG ACVFH ADBLJ ADCNI ADVLN AEUPX AFKRA AFPUW AIGII AIKXB AITUG AKBMS AKYEP ALMA_UNASSIGNED_HOLDINGS ALSLI AMRAJ ARAPS AZQEC BENPR BGLVJ BPHCQ CCPQU CNYFK DWQXO EBS EJD FDB GNUQQ GROUPED_DOAJ HCIFZ K6V K7- M1O M41 M~E OK1 P62 PHGZM PHGZT PIMPY PQGLB PQQKQ PROAC PRQQA PUEGO ROL SLJYH AAYXX AFFHD CITATION
ID	FETCH-LOGICAL-c340t-bf058c370753c67c06941393732e35d21bb7017afc160ab1c2df71d47d0a3dc3
ISICitedReferencesCount	2
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001591583800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	2543-9251
IngestDate	Sat Nov 29 07:35:19 EST 2025 Tue Nov 18 21:26:14 EST 2025 Sat Sep 27 17:13:33 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	3
Keywords	Automatic dictionary creation Big data Financial dictionary Text similarity algorithms Natural Language Processing
Language	English
License	This is an open access article under the CC BY license.
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c340t-bf058c370753c67c06941393732e35d21bb7017afc160ab1c2df71d47d0a3dc3
ORCID	0000-0002-1941-6693 0000-0001-7046-8512
OpenAccessLink	http://dx.doi.org/10.1016/j.dim.2024.100088
ParticipantIDs	crossref_primary_10_1016_j_dim_2024_100088 crossref_citationtrail_10_1016_j_dim_2024_100088 elsevier_sciencedirect_doi_10_1016_j_dim_2024_100088
PublicationCentury	2000
PublicationDate	September 2025 2025-09-00
PublicationDateYYYYMMDD	2025-09-01
PublicationDate_xml	– month: 09 year: 2025 text: September 2025
PublicationDecade	2020
PublicationTitle	Data and information management
PublicationYear	2025
Publisher	Elsevier Ltd
Publisher_xml	– name: Elsevier Ltd
References	(pp. 86–95). spaCy (b31) 2022 (pp. 811–816). Zotova, Agerri, Rigau (b40) 2021; 170 Hambarde, Proença (b9) 2023; 11 Oxford (b27) 2023 Alla, Chen, Batongbacal, Nekkantti, Dai, Jonnagaddala (b1) 2021; 1 Schuppler, B., Hagmüller, M., Morales-Cordovilla, J., & Pessentheiner, H. (2017). GRASS: The Graz Corpus of Read and Spontaneous Speech. In Bertin, M., & Atanassova, I. (2018). InTeReC: In-text Reference Corpus for Applying Natural Language Processing to Bibliometrics. In Këpuska, Rojanasthien (b14) 2011; 20 Sreeram, Dhawan, Sinha (b33) 2019; 110 Vijay, D., Bohra, A., Singh, V., Akhtar, S., & Shrivastava, M. (2018). Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text. In Beáta, M., Jesper, N., & Anne, P. (2015). The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. In Landauer, Dumais (b17) 1997; 104 Vorapatratorn, S., Suchato, A., & Punyabukkana, P. (2012). Automatic online text selection for constructing text corpus with custom phonetic distribution. In Kosem, Koppel, Zingano Kuhn, Michelfeit, Tiberius (b16) 2018; 32 Ellen, R. (1993). Automatically Constructing a Dictionary for Information Extraction Tasks. In Silverman, K., Anderson, V., Bellegarda, J., Lenzo, K., & Naik, D. (1999). Design and collection of a corpus of polyphones and prosodic contexts for speech synthesis research and development. In (pp. 2707–2708). (pp. 3192–3199). WordNet (b39) 2022 Kennedy, Szpakowicz (b13) 2014; 2 Jarmasz (b12) 2003 Amazon (b2) 2023 (pp. 54–62). (pp. 380–388). Leemann, Kolly, Britain (b18) 2018; 5 Lund, K., Burgess, C., & Atchley, R. (1995). Semantic and associative priming in high-dimensional semantic space. In Koeva, S., Stoyanova, I., Todorova, M., & Leseva, S. (2016). Semi-automatic Compilation of the Dictionary of Bulgarian Multiword Expressions. In Votesmart (b38) 2022 Oxford (b26) 2023 Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. In Szpakowicz, Kennedy (b34) 2012 Investopedia (b11) 2023 (pp. 128–135). Oronoz, Gojenola, Pérez, Ilarraza, Casillas (b24) 2015; 56 (pp. 660–665). Fantinuoli (b8) 2015; 36 (pp. 1465–1470). SportsDefinitions (b32) 2022 Herrero-Zazo, Segura-Bedmar, Martinez, Declerck (b10) 2013; 46 Moreno-García (b22) 2020; 11 Turan, Sonmez (b35) 2015; 177 Baroni, Murphy, Barbu, Poesio (b3) 2010; 34 Rydning (b28) 2023 Oxford (b25) 2023 (pp. 6–11). Miguel, Segura-Bedmar, Chacón-Solano, Guerrero-Aspizua (b21) 2022; 125 McHale, Myaeng (b20) 1996 Nadzurah, Ismail, Emran (b23) 2018; 9 Rydning (10.1016/j.dim.2024.100088_b28) 2023 Amazon (10.1016/j.dim.2024.100088_b2) 2023 Investopedia (10.1016/j.dim.2024.100088_b11) 2023 Landauer (10.1016/j.dim.2024.100088_b17) 1997; 104 Jarmasz (10.1016/j.dim.2024.100088_b12) 2003 WordNet (10.1016/j.dim.2024.100088_b39) 2022 Fantinuoli (10.1016/j.dim.2024.100088_b8) 2015; 36 Oxford (10.1016/j.dim.2024.100088_b26) 2023 Baroni (10.1016/j.dim.2024.100088_b3) 2010; 34 Oronoz (10.1016/j.dim.2024.100088_b24) 2015; 56 McHale (10.1016/j.dim.2024.100088_b20) 1996 Moreno-García (10.1016/j.dim.2024.100088_b22) 2020; 11 Votesmart (10.1016/j.dim.2024.100088_b38) 2022 10.1016/j.dim.2024.100088_b29 Nadzurah (10.1016/j.dim.2024.100088_b23) 2018; 9 SportsDefinitions (10.1016/j.dim.2024.100088_b32) 2022 Szpakowicz (10.1016/j.dim.2024.100088_b34) 2012 Sreeram (10.1016/j.dim.2024.100088_b33) 2019; 110 10.1016/j.dim.2024.100088_b36 10.1016/j.dim.2024.100088_b15 10.1016/j.dim.2024.100088_b37 Hambarde (10.1016/j.dim.2024.100088_b9) 2023; 11 Herrero-Zazo (10.1016/j.dim.2024.100088_b10) 2013; 46 Këpuska (10.1016/j.dim.2024.100088_b14) 2011; 20 Oxford (10.1016/j.dim.2024.100088_b27) 2023 10.1016/j.dim.2024.100088_b30 Miguel (10.1016/j.dim.2024.100088_b21) 2022; 125 Turan (10.1016/j.dim.2024.100088_b35) 2015; 177 10.1016/j.dim.2024.100088_b7 Leemann (10.1016/j.dim.2024.100088_b18) 2018; 5 10.1016/j.dim.2024.100088_b6 Kosem (10.1016/j.dim.2024.100088_b16) 2018; 32 10.1016/j.dim.2024.100088_b19 spaCy (10.1016/j.dim.2024.100088_b31) 2022 Alla (10.1016/j.dim.2024.100088_b1) 2021; 1 10.1016/j.dim.2024.100088_b5 Kennedy (10.1016/j.dim.2024.100088_b13) 2014; 2 Oxford (10.1016/j.dim.2024.100088_b25) 2023 Zotova (10.1016/j.dim.2024.100088_b40) 2021; 170 10.1016/j.dim.2024.100088_b4
References_xml	– year: 2022 ident: b39 article-title: What is WordNet – reference: (pp. 2707–2708). – reference: Beáta, M., Jesper, N., & Anne, P. (2015). The Uppsala Corpus of Student Writings: Corpus Creation, Annotation, and Analysis. In – reference: Ellen, R. (1993). Automatically Constructing a Dictionary for Information Extraction Tasks. In – start-page: 137 year: 1996 end-page: 146 ident: b20 article-title: Extraction of thematic roles from dictionary definitions publication-title: Language, information and computation : Selected papers from the 11th pacific Asia conference on language, information and computation – reference: Vijay, D., Bohra, A., Singh, V., Akhtar, S., & Shrivastava, M. (2018). Corpus Creation and Emotion Prediction for Hindi-English Code-Mixed Social Media Text. In – reference: Bertin, M., & Atanassova, I. (2018). InTeReC: In-text Reference Corpus for Applying Natural Language Processing to Bibliometrics. In – year: 2023 ident: b26 article-title: A dictionary of finance and banking – reference: Silverman, K., Anderson, V., Bellegarda, J., Lenzo, K., & Naik, D. (1999). Design and collection of a corpus of polyphones and prosodic contexts for speech synthesis research and development. In – year: 2023 ident: b27 article-title: The Oxford dictionary of sports science and medicine – year: 2023 ident: b2 article-title: Wall street dictionary: The most up-to-date and authoritative dictionary of financial terms, more than 4000 entries paperback – volume: 5 start-page: 1 year: 2018 end-page: 17 ident: b18 article-title: The english dialects app: The creation of a crowdsourced dialect corpus publication-title: Ampersand – year: 2022 ident: b32 article-title: Browse by sport – reference: Schuppler, B., Hagmüller, M., Morales-Cordovilla, J., & Pessentheiner, H. (2017). GRASS: The Graz Corpus of Read and Spontaneous Speech. In – reference: (pp. 1465–1470). – volume: 46 start-page: 914 year: 2013 end-page: 920 ident: b10 article-title: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions publication-title: Journal of Biomedical Informatics – year: 2023 ident: b28 article-title: Worldwide IDC global datasphere forecast, 2023–2027: It’s a distributed, diverse, and dynamic (3D) datasphere – reference: (pp. 811–816). – reference: Vorapatratorn, S., Suchato, A., & Punyabukkana, P. (2012). Automatic online text selection for constructing text corpus with custom phonetic distribution. In – reference: (pp. 6–11). – reference: (pp. 380–388). – volume: 32 year: 2018 ident: b16 article-title: Identification and automatic extraction of good dictionary examples: The case(s) of GDEX publication-title: International Journal of Lexicography – volume: 36 start-page: 62 year: 2015 end-page: 87 ident: b8 article-title: Revisiting corpus creation and analysis tools for translation tasks publication-title: Cadernos de Tradução – volume: 104 start-page: 211 year: 1997 end-page: 240 ident: b17 article-title: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge publication-title: Psychological Review – volume: 1 year: 2021 ident: b1 article-title: Cohort selection for construction of a clinical natural language processing corpus publication-title: Computer Methods and Programs in Biomedicine Update – reference: (pp. 660–665). – year: 2022 ident: b38 article-title: Political glossary – reference: (pp. 3192–3199). – volume: 2 start-page: 1 year: 2014 ident: b13 article-title: Evaluation of automatic updates of Roget’s Thesaurus publication-title: Journal of Language Modelling – volume: 9 start-page: 442 year: 2018 end-page: 447 ident: b23 article-title: Performance analysis of machine learning algorithms for missing value imputation publication-title: International Journal of Advanced Computer Science and Applications – volume: 34 start-page: 222 year: 2010 end-page: 254 ident: b3 article-title: Strudel: A corpus-based semantic model based on properties and types publication-title: Cognitive Science – volume: 110 start-page: 76 year: 2019 end-page: 89 ident: b33 article-title: IITG-HingCoS corpus: A hinglish code-switching database for automatic speech recognition publication-title: Speech Communication – volume: 11 year: 2020 ident: b22 article-title: Information retrieval and social media mining publication-title: Information – year: 2023 ident: b11 article-title: Financial terms dictionary – start-page: 1 year: 2012 end-page: 228 ident: b34 article-title: Automatic supervised thesauri construction with Roget’s thesaurus publication-title: University of ottawa – reference: (pp. 86–95). – volume: 20 start-page: 49 year: 2011 end-page: 82 ident: b14 article-title: Speech corpus generation from DVDs of mov-ies and TV series publication-title: Journal of International Technology and Information Management – reference: Charikar, M. (2002). Similarity estimation techniques from rounding algorithms. In – reference: Lund, K., Burgess, C., & Atchley, R. (1995). Semantic and associative priming in high-dimensional semantic space. In – reference: (pp. 54–62). – reference: Koeva, S., Stoyanova, I., Todorova, M., & Leseva, S. (2016). Semi-automatic Compilation of the Dictionary of Bulgarian Multiword Expressions. In – volume: 56 start-page: 318 year: 2015 end-page: 332 ident: b24 article-title: On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions publication-title: Journal of Biomedical Informatics – volume: 125 year: 2022 ident: b21 article-title: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms publication-title: Journal of Biomedical Informatics – year: 2023 ident: b25 article-title: The concise oxford dictionary of politics – volume: 170 year: 2021 ident: b40 article-title: Semi-automatic generation of multilingual datasets for stance detection in Twitter publication-title: Expert Systems with Applications – volume: 11 start-page: 76581 year: 2023 end-page: 76604 ident: b9 article-title: Information retrieval: Recent advances and beyond publication-title: IEEE Access – reference: (pp. 128–135). – volume: 177 start-page: 169 year: 2015 end-page: 177 ident: b35 article-title: Automatize document topic and subtopic detection with support of a corpus publication-title: Procedia - Social and Behavioral Sciences – start-page: 1 year: 2003 end-page: 231 ident: b12 article-title: Roget’s thesaurus as a lexical resource for natural language processing – year: 2022 ident: b31 article-title: spaCy 101: Everything you need to know – volume: 177 start-page: 169 year: 2015 ident: 10.1016/j.dim.2024.100088_b35 article-title: Automatize document topic and subtopic detection with support of a corpus publication-title: Procedia - Social and Behavioral Sciences doi: 10.1016/j.sbspro.2015.02.373 – start-page: 137 year: 1996 ident: 10.1016/j.dim.2024.100088_b20 article-title: Extraction of thematic roles from dictionary definitions – ident: 10.1016/j.dim.2024.100088_b6 doi: 10.1145/509907.509965 – year: 2023 ident: 10.1016/j.dim.2024.100088_b27 – volume: 20 start-page: 49 year: 2011 ident: 10.1016/j.dim.2024.100088_b14 article-title: Speech corpus generation from DVDs of mov-ies and TV series publication-title: Journal of International Technology and Information Management doi: 10.58729/1941-6679.1100 – volume: 34 start-page: 222 year: 2010 ident: 10.1016/j.dim.2024.100088_b3 article-title: Strudel: A corpus-based semantic model based on properties and types publication-title: Cognitive Science doi: 10.1111/j.1551-6709.2009.01068.x – ident: 10.1016/j.dim.2024.100088_b30 doi: 10.21437/Eurospeech.1999-580 – volume: 5 start-page: 1 year: 2018 ident: 10.1016/j.dim.2024.100088_b18 article-title: The english dialects app: The creation of a crowdsourced dialect corpus publication-title: Ampersand doi: 10.1016/j.amper.2017.11.001 – volume: 110 start-page: 76 year: 2019 ident: 10.1016/j.dim.2024.100088_b33 article-title: IITG-HingCoS corpus: A hinglish code-switching database for automatic speech recognition publication-title: Speech Communication doi: 10.1016/j.specom.2019.04.007 – start-page: 1 year: 2003 ident: 10.1016/j.dim.2024.100088_b12 – volume: 104 start-page: 211 year: 1997 ident: 10.1016/j.dim.2024.100088_b17 article-title: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge publication-title: Psychological Review doi: 10.1037/0033-295X.104.2.211 – volume: 1 year: 2021 ident: 10.1016/j.dim.2024.100088_b1 article-title: Cohort selection for construction of a clinical natural language processing corpus publication-title: Computer Methods and Programs in Biomedicine Update doi: 10.1016/j.cmpbup.2021.100024 – year: 2023 ident: 10.1016/j.dim.2024.100088_b28 – year: 2023 ident: 10.1016/j.dim.2024.100088_b2 – volume: 170 year: 2021 ident: 10.1016/j.dim.2024.100088_b40 article-title: Semi-automatic generation of multilingual datasets for stance detection in Twitter publication-title: Expert Systems with Applications doi: 10.1016/j.eswa.2020.114547 – ident: 10.1016/j.dim.2024.100088_b5 – volume: 56 start-page: 318 year: 2015 ident: 10.1016/j.dim.2024.100088_b24 article-title: On the creation of a clinical gold standard corpus in Spanish: Mining adverse drug reactions publication-title: Journal of Biomedical Informatics doi: 10.1016/j.jbi.2015.06.016 – volume: 9 start-page: 442 year: 2018 ident: 10.1016/j.dim.2024.100088_b23 article-title: Performance analysis of machine learning algorithms for missing value imputation publication-title: International Journal of Advanced Computer Science and Applications – ident: 10.1016/j.dim.2024.100088_b36 doi: 10.18653/v1/N18-4018 – ident: 10.1016/j.dim.2024.100088_b7 – ident: 10.1016/j.dim.2024.100088_b19 – year: 2022 ident: 10.1016/j.dim.2024.100088_b38 – volume: 11 issue: 12 year: 2020 ident: 10.1016/j.dim.2024.100088_b22 article-title: Information retrieval and social media mining publication-title: Information doi: 10.3390/info11120578 – ident: 10.1016/j.dim.2024.100088_b15 – year: 2022 ident: 10.1016/j.dim.2024.100088_b32 – volume: 11 start-page: 76581 year: 2023 ident: 10.1016/j.dim.2024.100088_b9 article-title: Information retrieval: Recent advances and beyond publication-title: IEEE Access doi: 10.1109/ACCESS.2023.3295776 – start-page: 1 year: 2012 ident: 10.1016/j.dim.2024.100088_b34 article-title: Automatic supervised thesauri construction with Roget’s thesaurus – year: 2023 ident: 10.1016/j.dim.2024.100088_b11 – volume: 36 start-page: 62 year: 2015 ident: 10.1016/j.dim.2024.100088_b8 article-title: Revisiting corpus creation and analysis tools for translation tasks publication-title: Cadernos de Tradução doi: 10.5007/2175-7968.2016v36nesp1p62 – volume: 125 year: 2022 ident: 10.1016/j.dim.2024.100088_b21 article-title: The RareDis corpus: a corpus annotated with rare diseases, their signs and symptoms publication-title: Journal of Biomedical Informatics – ident: 10.1016/j.dim.2024.100088_b37 doi: 10.1109/JCSSE.2012.6261916 – ident: 10.1016/j.dim.2024.100088_b29 – year: 2022 ident: 10.1016/j.dim.2024.100088_b39 – year: 2023 ident: 10.1016/j.dim.2024.100088_b25 – year: 2023 ident: 10.1016/j.dim.2024.100088_b26 – volume: 46 start-page: 914 year: 2013 ident: 10.1016/j.dim.2024.100088_b10 article-title: The DDI corpus: An annotated corpus with pharmacological substances and drug–drug interactions publication-title: Journal of Biomedical Informatics doi: 10.1016/j.jbi.2013.07.011 – volume: 2 start-page: 1 year: 2014 ident: 10.1016/j.dim.2024.100088_b13 article-title: Evaluation of automatic updates of Roget’s Thesaurus publication-title: Journal of Language Modelling doi: 10.15398/jlm.v2i1.78 – ident: 10.1016/j.dim.2024.100088_b4 – volume: 32 year: 2018 ident: 10.1016/j.dim.2024.100088_b16 article-title: Identification and automatic extraction of good dictionary examples: The case(s) of GDEX publication-title: International Journal of Lexicography – year: 2022 ident: 10.1016/j.dim.2024.100088_b31
SSID	ssj0002090391
Score	2.316064
Snippet	Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	100088
SubjectTerms	Automatic dictionary creation Big data Financial dictionary Natural Language Processing Text similarity algorithms
Title	Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash
URI	https://dx.doi.org/10.1016/j.dim.2024.100088
Volume	9
WOSCitedRecordID	wos001591583800003&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVAON databaseName: DOAJ Directory of Open Access Journals customDbUrl: eissn: 2543-9251 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002090391 issn: 2543-9251 databaseCode: DOA dateStart: 20170101 isFulltext: true titleUrlDefault: https://www.doaj.org/ providerName: Directory of Open Access Journals – providerCode: PRVHPJ databaseName: ROAD: Directory of Open Access Scholarly Resources customDbUrl: eissn: 2543-9251 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002090391 issn: 2543-9251 databaseCode: M~E dateStart: 20170101 isFulltext: true titleUrlDefault: https://road.issn.org providerName: ISSN International Centre
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3Pb9MwFLZgcOAyDQFiwCYfOKEFOXESJ8eKH-LAKqRFUE6RYztqJ5ZWbTqNC38779mOU00FARKXqLLsuPJnvTw_f-97hLxkXBrVaBYpVfAoFXkZSY2AiFY1OVdpZkuyfP4optNiNis_-XKpG1tOQHRdcXNTrv4r1NAGYGPq7F_AHV4KDfAbQIcnwA7PPwJ-su2X4IYaS4x0gqx6YbMXkCAXnMTtZkiUwmxF_JhpvDj4AofRqbuguIDj9PfA7lxczeVmvuvLvpW99OJNIQXSs2F36TTVcrWW1uZO5ldjunW1XbvY63lQ__bBhyQL7CofERuyYkYKEhguzK-PysQLyZo9bd7yljsbjO-15y60cPlaL1A1IEmR1MFcHcBbMtkXOANOgKRY6Dm7S-4lIivR0p3_GONuCQambCHF8JeGy25L-7s10353ZccFqY7IoT870InD_CG5Y7pH5GvAmw540xFvOuBNLd7YhQLe1OJNodnjfUYt2mcU8KQe68ekev-uevMh8vUyIsVT1kdNy7JCcQFeIFe5UJjTHKPgIU8Mz3QSN40AAyxbFedMNrFKdCtinQrNJNeKPyEH3bIzTwnNi5wziWqOMkuF1ejXZQH23WR5Kk17TNiwKrXyWvJY0uRbPZAGL2tYyBoXsnYLeUxehSErJ6Tyu87psNS19wSdh1fDxvj1sGf_Nuw5eTDu7RfkoF9vzQm5r677xWZ9aqM0p3YT_QTne4DD
linkProvider	ISSN International Centre
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Automated+thematic+dictionary+creation+using+the+web+based+on+WordNet%2C+Spacy%2C+and+Simhash&rft.jtitle=Data+and+information+management&rft.au=Toprak%2C+Ahmet&rft.au=Turan%2C+Metin&rft.date=2025-09-01&rft.pub=Elsevier+Ltd&rft.issn=2543-9251&rft.eissn=2543-9251&rft.volume=9&rft.issue=3&rft_id=info:doi/10.1016%2Fj.dim.2024.100088&rft.externalDocID=S254392512400024X
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2543-9251&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2543-9251&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2543-9251&client=summon