View in EDS

A Natural Language Processing Framework for Document Similarity in Java Environments

Saved in:

Bibliographic Details
Title:	A Natural Language Processing Framework for Document Similarity in Java Environments
Authors:	Ayan Hussain, Moh Zaid Khan, Rayyan Arif Hussain, Abdul Ahad, Ambreen Anees
Source:	International Journal of Innovative Science and Research Technology. :4410-4415
Publisher Information:	International Journal of Innovative Science and Research Technology, 2025.
Publication Year:	2025
Description:	Document similarity plays a pivotal role in the field of Natural Language Processing (NLP), especially in tasks that require identifying the degree of relatedness between textual content. This paper presents a comprehensive study and implementation of document similarity techniques using the Java programming language, with a focus on practical NLP approaches. The motivation behind this work stems from real-world applications such as plagiarism detection, content recommendation systems, semantic search engines, and automated document classification. The system developed in this research employs a multi-step NLP pipeline beginning with data preprocessing. This includes default procedures such as text normalizing, tokenizing, stop word removal, and optional stemming or lemmatization. Following post-preprocessing, documents are converted into numerical vectors using the Term Frequency–Inverse Document Frequency (TF-IDF) weighting scheme, which determines how important terms are in each document in relation to the collection as a whole.Since cosine similarity is effective at comparing text-based vectors in a high-dimensional space, it is used to evaluate similarity among document vectors.
Document Type:	Article
Language:	English
DOI:	10.38124/ijisrt/25apr1961
Accession Number:	edsair.doi...........5bb91d0faf6f181f882ea78fa2ed2296
Database:	OpenAIRE

Nájsť tento článok vo Web of Science

Description
Abstract:	Document similarity plays a pivotal role in the field of Natural Language Processing (NLP), especially in tasks that require identifying the degree of relatedness between textual content. This paper presents a comprehensive study and implementation of document similarity techniques using the Java programming language, with a focus on practical NLP approaches. The motivation behind this work stems from real-world applications such as plagiarism detection, content recommendation systems, semantic search engines, and automated document classification. The system developed in this research employs a multi-step NLP pipeline beginning with data preprocessing. This includes default procedures such as text normalizing, tokenizing, stop word removal, and optional stemming or lemmatization. Following post-preprocessing, documents are converted into numerical vectors using the Term Frequency–Inverse Document Frequency (TF-IDF) weighting scheme, which determines how important terms are in each document in relation to the collection as a whole.Since cosine similarity is effective at comparing text-based vectors in a high-dimensional space, it is used to evaluate similarity among document vectors.
DOI:	10.38124/ijisrt/25apr1961