Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science

In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagia...

Full description

Saved in:
Bibliographic Details
Published in:Elektrotehniski Vestnik Vol. 92; no. 3; pp. 151 - 156
Main Authors: Poljak, Josip, Crčić, Dražen, Horvat, Tomislav
Format: Journal Article
Language:English
Published: Ljubljana Elektrotehniski Vestnik 01.01.2025
Subjects:
ISSN:0013-5852, 2232-3236
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagiarism detection, sentiment analysis, text summarization, and beyond. This paper provides a comprehensive survey and comparative analysis of established text similarity algorithms, including edit-distance-based metrics (Levenshtein and Damerau-Levenshtein), character-based measures (Jaro and Jaro-Winkler), local sequence alignment (Smith-Waterman), vector-based semantic measures (Cosine similarity), and methods reliant on subsequence statistics (N-gram similarity). Each algorithm is analyzed in terms of its underlying theoretical foundations, computational complexity, performance characteristics, and domain-specific suitability. While traditional approaches excel in correcting typographical errors or identifying subtle lexical variations, more robust methods handle semantically rich corpora, larger text bodies, and intricate linguistic phenomena. Moreover, potential avenues for improvement are explored, including hybridization of existing approaches and the integration of emerging machine learning and deep neural models. This holistic examination aims to inform the selection and development of text similarity measures for diverse real-world applications and to guide future research directions in computational linguistics.
AbstractList In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagiarism detection, sentiment analysis, text summarization, and beyond. This paper provides a comprehensive survey and comparative analysis of established text similarity algorithms, including edit-distance-based metrics (Levenshtein and Damerau-Levenshtein), character-based measures (Jaro and Jaro-Winkler), local sequence alignment (Smith-Waterman), vector-based semantic measures (Cosine similarity), and methods reliant on subsequence statistics (N-gram similarity). Each algorithm is analyzed in terms of its underlying theoretical foundations, computational complexity, performance characteristics, and domain-specific suitability. While traditional approaches excel in correcting typographical errors or identifying subtle lexical variations, more robust methods handle semantically rich corpora, larger text bodies, and intricate linguistic phenomena. Moreover, potential avenues for improvement are explored, including hybridization of existing approaches and the integration of emerging machine learning and deep neural models. This holistic examination aims to inform the selection and development of text similarity measures for diverse real-world applications and to guide future research directions in computational linguistics.
Author Crčić, Dražen
Horvat, Tomislav
Poljak, Josip
Author_xml – sequence: 1
  givenname: Josip
  surname: Poljak
  fullname: Poljak, Josip
– sequence: 2
  givenname: Dražen
  surname: Crčić
  fullname: Crčić, Dražen
– sequence: 3
  givenname: Tomislav
  surname: Horvat
  fullname: Horvat, Tomislav
BookMark eNotj11LwzAYRoNMcM79h4DXhXw0SXNZhl8wUFi9Hm_TdMtok5pk4v69FffcnHN14LlHCx-8vUFLxjgrOONygZaEUF6ISrA7tE7pROYppQVlS3TYhHGCCNl9W1x7GC7JJRx63NifjHdudANEly-4Hg5hluOYMPgON0frIv6IYLIzMOB6moZZsgs-YefxX_acbcQ746w39gHd9jAku75yhT6fn5rNa7F9f3nb1NtiopXIhRGyBCGZJpaU0FuqOyoUI0a3mlktWqWVIqJSleRStKA40LaTpux1X_UK-Ao9_nenGL7ONuX9KZzj_CvtOWOKSCloyX8BVbRWsA
ContentType Journal Article
Copyright Copyright Elektrotehniski Vestnik 2025
Copyright_xml – notice: Copyright Elektrotehniski Vestnik 2025
DBID 7SP
8FD
8FE
8FG
ABJCF
ABUWG
AFKRA
BENPR
BGLVJ
BYOGL
CCPQU
DWQXO
HCIFZ
L6V
L7M
M7S
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PTHSS
DatabaseName Electronics & Communications Abstracts
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
Materials Science & Engineering Collection
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
ProQuest Central
ProQuest Technology Collection (LUT)
East Europe, Central Europe Database (ProQuest)
ProQuest One
ProQuest Central Korea
SciTech Premium Collection
ProQuest Engineering Collection
Advanced Technologies Database with Aerospace
Engineering Database
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
Engineering Collection
DatabaseTitle Engineering Database
Technology Collection
Technology Research Database
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition
Electronics & Communications Abstracts
East Europe, Central Europe Database
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest Engineering Collection
ProQuest One Academic UKI Edition
ProQuest Central Korea
Materials Science & Engineering Collection
ProQuest One Academic
ProQuest Central (New)
Advanced Technologies Database with Aerospace
Engineering Collection
ProQuest One Academic (New)
DatabaseTitleList Engineering Database
Database_xml – sequence: 1
  dbid: BYOGL
  name: East Europe, Central Europe Database (ProQuest)
  url: https://search.proquest.com/eastcentraleurope
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2232-3236
EndPage 156
GroupedDBID 7SP
8FD
8FE
8FG
ABJCF
ABUWG
ACIWK
AFKRA
ALMA_UNASSIGNED_HOLDINGS
BENPR
BGLVJ
BPHCQ
BYOGL
CCPQU
DWQXO
HCIFZ
L6V
L7M
M7S
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
PROAC
PTHSS
PUEGO
ID FETCH-LOGICAL-p185t-c564a56290e04afe19d15720c9b92e95b7977058786365ba73a1bd6c4f9f8f7a3
IEDL.DBID BENPR
ISICitedReferencesCount 0
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001519149100009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0013-5852
IngestDate Sat Aug 23 13:33:33 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 3
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-p185t-c564a56290e04afe19d15720c9b92e95b7977058786365ba73a1bd6c4f9f8f7a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 3227066514
PQPubID 1606352
PageCount 6
ParticipantIDs proquest_journals_3227066514
PublicationCentury 2000
PublicationDate 2025-01-01
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – month: 01
  year: 2025
  text: 2025-01-01
  day: 01
PublicationDecade 2020
PublicationPlace Ljubljana
PublicationPlace_xml – name: Ljubljana
PublicationTitle Elektrotehniski Vestnik
PublicationYear 2025
Publisher Elektrotehniski Vestnik
Publisher_xml – name: Elektrotehniski Vestnik
SSID ssj0000779512
Score 2.2969646
SecondaryResourceType review_article
Snippet In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational...
SourceID proquest
SourceType Aggregation Database
StartPage 151
SubjectTerms Accuracy
Algorithms
Automatic text analysis
Automation
Comparative analysis
Documents
Information retrieval
Linguistics
Machine learning
Measurement techniques
Natural language processing
Plagiarism
Semantics
Sentiment analysis
Similarity
Similarity measures
Title Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science
URI https://www.proquest.com/docview/3227066514
Volume 92
WOSCitedRecordID wos001519149100009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: East Europe, Central Europe Database (ProQuest)
  customDbUrl:
  eissn: 2232-3236
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000779512
  issn: 0013-5852
  databaseCode: BYOGL
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/eastcentraleurope
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Engineering Database
  customDbUrl:
  eissn: 2232-3236
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000779512
  issn: 0013-5852
  databaseCode: M7S
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 2232-3236
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0000779512
  issn: 0013-5852
  databaseCode: BENPR
  dateStart: 20120101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1NTwMhFCTaetCD38aP2nDwStylsMDJ1KaNB9M0tia9NcBC3cRua3f19wsta5uYePHMAcKDecN7ZAaAO24SYpRUiLn0j0gcGeTSkEKYWa4cQaBqLZL0zPp9Ph6LQSi4FeFbZYWJK6BO59rXyO_dwWO-TRCTh8UH8q5RvrsaLDR2Qd0rlZEaqD92-4OXnypLxJijELiyMXDcGP8C3VUm6R39dw3H4DBwSNheB_0E7Jj8FBxsKQuegWlno-oNK-EROLdw5MAYDrNZ5l60joDD9vvUzVC-zQoo8xSOfN8ArkWMtJ9jq78NsxxWJhAwYMI5eO11R50nFDwV0MJl5hJpmhDpOI-ITESkNbFIY8pwpIUS2AiqmCOEEeWMJ62EKslaMlZpookVllsmWxegls9zcwmgptRgQ5MWcVG1lgvlXjtW8kQ7ACUGX4FGtYuTcDGKyWYLr_8evgH72FvtrqodDVArl5_mFuzprzIrls0Q56b_qjn8BnHNtlo
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT8JAEJ4gmqgH38YH6h702Nguu932YAxBiQQkJmDCDbtlF0mkIEWNf8rf6GxLhcTEGwfPTbaZzPT75rGdD-DcUy5TMpCWQPq3mGMrC2lIWlRoT2KCwGW6JKkuGg2v3fYfcvCV_QtjrlVmmJgAdXcYmh75JQaeMGMCh12PXi2jGmWmq5mERhoWNfX5gSVbfFW9Qf9eUFq5bZXvrKmqgDVCbppYIXdZgKzv28pmgVaO33W4oHboS58qn0uBKZHNPeG5RZfLQBQDR3bdkGlfe1oERTx3CZaZQf_kqmDzp6djC4EJC81EEzATp78gPuGtyuZ_s3gLNqYZMimlIb0NORXtwPrc3sRd6JVnO8tJtlaFDDVpIdWQZn_Qx3odywtSeumhRZPnQUyCqEtaZipC0hVNoXnH3PSe9COSSVyQKeLtweNCLN2HfDSM1AGQkHNFFXeLDGNWa8-XWMvpwHNDpAem6CEUMq91pp993Jm57Ojvx2ewete6r3fq1UbtGNaoERVO-joFyE_Gb-oEVsL3ST8enyYRRuBp0Q7-BgJLD5s
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LS8NAEB60iujBt_iougc9BpPtbnZzEKmPammpghX0VLPJbg3YtLZR8a_565y0iRYEbz14XsgyzDDfNzOb-QAOpHaZVr6yBMK_xRxbWwhDyqLCSIUEgavRkqS6aDTk_b13MwWf-b8w6bPKPCcOE3XYDdIe-REGnkjHBA47MtmziJvzyknvxUoVpNJJay6nMQqRmv54x_JtcFw9R18fUlq5aJ5dWZnCgNVDnEqsgLvMRwbg2dpmvtGOFzpcUDvwlEe1x5VAemRzKaRbcrnyRcl3VOgGzHhGGuGX8LvTMCNdYbMCzJw-XF_Wvzs8thBIX2guoYC8nP5K-EMUqyz9Z_uXYTHjzqQ8CvYVmNLxKiyMbVRcg_bZzzZzki9cIV1DmmgJuY06EVbyWHiQ8nMbLUqeOgPixyFppvMSMlreFKR3jM31SRSTXPyCZLlwHe4mYukGFOJurDeBBJxrqrlbYhjNxkhPYZVnfOkGCBxM0y0o5h5sZQlh0Ppx3_bfx_swh35t1auN2g7M01RteNjwKUIh6b_qXZgN3pJo0N_Lwo3A46Q9_AXaqxm2
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparative+Analysis+of+Text+Similarity+Algorithms+and+Their+Practical+Applications+in+Computer+Science&rft.jtitle=Elektrotehniski+Vestnik&rft.au=Poljak%2C+Josip&rft.au=Cr%C4%8Di%C4%87%2C+Dra%C5%BEen&rft.au=Horvat%2C+Tomislav&rft.date=2025-01-01&rft.pub=Elektrotehniski+Vestnik&rft.issn=0013-5852&rft.eissn=2232-3236&rft.volume=92&rft.issue=3&rft.spage=151&rft.epage=156&rft.externalDBID=HAS_PDF_LINK
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0013-5852&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0013-5852&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0013-5852&client=summon