Automated thematic dictionary creation using the web based on WordNet, Spacy, and Simhash

Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The m...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:Data and information management Ročník 9; číslo 3; s. 100088
Hlavní autoři: Toprak, Ahmet, Turan, Metin
Médium: Journal Article
Jazyk:angličtina
Vydáno: Elsevier Ltd 01.09.2025
Témata:
ISSN:2543-9251, 2543-9251
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Popis
Shrnutí:Dictionary is helpful tool for most of the context-based Natural Language Processing researches. The words in the language dictionary establish the context coverage for a specific application area. In the study, a novel model is proposed to generate thematic dictionary using the web resources. The model gets the benefit of different text similarity algorithms to enhance dictionary coverage and increase its internal similarity. For example, in order to create a financial dictionary, algorithm was started with a general seed word “finance”. Web search was executed with this word, and the top three web pages returned by the web search engine were processed. The words in the contents of these web pages were ranked according to their meaning values using the term frequency-inverse document frequency metric. Then, selected words were initially inserted into three different dictionaries which were controlled by WordNet, Spacy, and Simhash text similarity algorithms separately. All of these words added into these dictionaries were used for further web search again together. This process (search and dictionary update) of the algorithm was repeated for each dictionary separately until each reaches to the upper count of words (250 words have been set). Finally, these three dictionaries are merged to form the final financial dictionary. This financial dictionary was compared with the manually created financial dictionary in terms of quality. Consequently, the internal WordNet similarity rate of the words in the automatic financial dictionary was 29.01%, while it was 23.41% in the manual financial dictionary. For the similarity measure of both dictionaries, when the words were merged in the automatic and manual dictionaries into full texts and evaluated both in terms of Simhash similarity, then 72.30% similarity was obtained. It was seen that although both dictionaries produce almost similar words, the automatic dictionary had stronger internal semantic representation. •Automated model for creating thematic dictionaries using web-based resources.•Combines WordNet, Spacy, and Simhash to enhance dictionary coverage and similarity.•Successfully created a financial dictionary with stronger semantic representation.•Reduces time and effort in dictionary creation compared to manual methods.•Applicable for tasks like text classification, summarization, and theme detection.
ISSN:2543-9251
2543-9251
DOI:10.1016/j.dim.2024.100088