Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science
In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagia...
Saved in:
| Published in: | Elektrotehniski Vestnik Vol. 92; no. 3; pp. 151 - 156 |
|---|---|
| Main Authors: | , , |
| Format: | Journal Article |
| Language: | English |
| Published: |
Ljubljana
Elektrotehniski Vestnik
01.01.2025
|
| Subjects: | |
| ISSN: | 0013-5852, 2232-3236 |
| Online Access: | Get full text |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Abstract | In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagiarism detection, sentiment analysis, text summarization, and beyond. This paper provides a comprehensive survey and comparative analysis of established text similarity algorithms, including edit-distance-based metrics (Levenshtein and Damerau-Levenshtein), character-based measures (Jaro and Jaro-Winkler), local sequence alignment (Smith-Waterman), vector-based semantic measures (Cosine similarity), and methods reliant on subsequence statistics (N-gram similarity). Each algorithm is analyzed in terms of its underlying theoretical foundations, computational complexity, performance characteristics, and domain-specific suitability. While traditional approaches excel in correcting typographical errors or identifying subtle lexical variations, more robust methods handle semantically rich corpora, larger text bodies, and intricate linguistic phenomena. Moreover, potential avenues for improvement are explored, including hybridization of existing approaches and the integration of emerging machine learning and deep neural models. This holistic examination aims to inform the selection and development of text similarity measures for diverse real-world applications and to guide future research directions in computational linguistics. |
|---|---|
| AbstractList | In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational linguistics and natural language processing (NLP). Text similarity algorithms support critical applications in information retrieval, plagiarism detection, sentiment analysis, text summarization, and beyond. This paper provides a comprehensive survey and comparative analysis of established text similarity algorithms, including edit-distance-based metrics (Levenshtein and Damerau-Levenshtein), character-based measures (Jaro and Jaro-Winkler), local sequence alignment (Smith-Waterman), vector-based semantic measures (Cosine similarity), and methods reliant on subsequence statistics (N-gram similarity). Each algorithm is analyzed in terms of its underlying theoretical foundations, computational complexity, performance characteristics, and domain-specific suitability. While traditional approaches excel in correcting typographical errors or identifying subtle lexical variations, more robust methods handle semantically rich corpora, larger text bodies, and intricate linguistic phenomena. Moreover, potential avenues for improvement are explored, including hybridization of existing approaches and the integration of emerging machine learning and deep neural models. This holistic examination aims to inform the selection and development of text similarity measures for diverse real-world applications and to guide future research directions in computational linguistics. |
| Author | Crčić, Dražen Horvat, Tomislav Poljak, Josip |
| Author_xml | – sequence: 1 givenname: Josip surname: Poljak fullname: Poljak, Josip – sequence: 2 givenname: Dražen surname: Crčić fullname: Crčić, Dražen – sequence: 3 givenname: Tomislav surname: Horvat fullname: Horvat, Tomislav |
| BookMark | eNotj11LwzAYRoNMcM79h4DXhXw0SXNZhl8wUFi9Hm_TdMtok5pk4v69FffcnHN14LlHCx-8vUFLxjgrOONygZaEUF6ISrA7tE7pROYppQVlS3TYhHGCCNl9W1x7GC7JJRx63NifjHdudANEly-4Hg5hluOYMPgON0frIv6IYLIzMOB6moZZsgs-YefxX_acbcQ746w39gHd9jAku75yhT6fn5rNa7F9f3nb1NtiopXIhRGyBCGZJpaU0FuqOyoUI0a3mlktWqWVIqJSleRStKA40LaTpux1X_UK-Ao9_nenGL7ONuX9KZzj_CvtOWOKSCloyX8BVbRWsA |
| ContentType | Journal Article |
| Copyright | Copyright Elektrotehniski Vestnik 2025 |
| Copyright_xml | – notice: Copyright Elektrotehniski Vestnik 2025 |
| DBID | 7SP 8FD 8FE 8FG ABJCF ABUWG AFKRA BENPR BGLVJ BYOGL CCPQU DWQXO HCIFZ L6V L7M M7S PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PTHSS |
| DatabaseName | Electronics & Communications Abstracts Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central ProQuest Technology Collection (LUT) East Europe, Central Europe Database (ProQuest) ProQuest One ProQuest Central Korea SciTech Premium Collection ProQuest Engineering Collection Advanced Technologies Database with Aerospace Engineering Database ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
| DatabaseTitle | Engineering Database Technology Collection Technology Research Database ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition Electronics & Communications Abstracts East Europe, Central Europe Database ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest One Applied & Life Sciences ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic ProQuest Central (New) Advanced Technologies Database with Aerospace Engineering Collection ProQuest One Academic (New) |
| DatabaseTitleList | Engineering Database |
| Database_xml | – sequence: 1 dbid: BYOGL name: East Europe, Central Europe Database (ProQuest) url: https://search.proquest.com/eastcentraleurope sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering |
| EISSN | 2232-3236 |
| EndPage | 156 |
| GroupedDBID | 7SP 8FD 8FE 8FG ABJCF ABUWG ACIWK AFKRA ALMA_UNASSIGNED_HOLDINGS BENPR BGLVJ BPHCQ BYOGL CCPQU DWQXO HCIFZ L6V L7M M7S PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI PRINS PROAC PTHSS PUEGO |
| ID | FETCH-LOGICAL-p185t-c564a56290e04afe19d15720c9b92e95b7977058786365ba73a1bd6c4f9f8f7a3 |
| IEDL.DBID | BENPR |
| ISICitedReferencesCount | 0 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=001519149100009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 0013-5852 |
| IngestDate | Sat Aug 23 13:33:33 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 3 |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-p185t-c564a56290e04afe19d15720c9b92e95b7977058786365ba73a1bd6c4f9f8f7a3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| PQID | 3227066514 |
| PQPubID | 1606352 |
| PageCount | 6 |
| ParticipantIDs | proquest_journals_3227066514 |
| PublicationCentury | 2000 |
| PublicationDate | 2025-01-01 |
| PublicationDateYYYYMMDD | 2025-01-01 |
| PublicationDate_xml | – month: 01 year: 2025 text: 2025-01-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | Ljubljana |
| PublicationPlace_xml | – name: Ljubljana |
| PublicationTitle | Elektrotehniski Vestnik |
| PublicationYear | 2025 |
| Publisher | Elektrotehniski Vestnik |
| Publisher_xml | – name: Elektrotehniski Vestnik |
| SSID | ssj0000779512 |
| Score | 2.2969646 |
| SecondaryResourceType | review_article |
| Snippet | In an era defined by vast volumes of digital text, the capacity to compare, interpret, and quantify textual similarity is a cornerstone of modern computational... |
| SourceID | proquest |
| SourceType | Aggregation Database |
| StartPage | 151 |
| SubjectTerms | Accuracy Algorithms Automatic text analysis Automation Comparative analysis Documents Information retrieval Linguistics Machine learning Measurement techniques Natural language processing Plagiarism Semantics Sentiment analysis Similarity Similarity measures |
| Title | Comparative Analysis of Text Similarity Algorithms and Their Practical Applications in Computer Science |
| URI | https://www.proquest.com/docview/3227066514 |
| Volume | 92 |
| WOSCitedRecordID | wos001519149100009&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVPQU databaseName: East Europe, Central Europe Database (ProQuest) customDbUrl: eissn: 2232-3236 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000779512 issn: 0013-5852 databaseCode: BYOGL dateStart: 20120101 isFulltext: true titleUrlDefault: https://search.proquest.com/eastcentraleurope providerName: ProQuest – providerCode: PRVPQU databaseName: Engineering Database customDbUrl: eissn: 2232-3236 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000779512 issn: 0013-5852 databaseCode: M7S dateStart: 20120101 isFulltext: true titleUrlDefault: http://search.proquest.com providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 2232-3236 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0000779512 issn: 0013-5852 databaseCode: BENPR dateStart: 20120101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1NTwMhFCTaetCD38aP2nDwStylsMDJ1KaNB9M0tia9NcBC3cRua3f19wsta5uYePHMAcKDecN7ZAaAO24SYpRUiLn0j0gcGeTSkEKYWa4cQaBqLZL0zPp9Ph6LQSi4FeFbZYWJK6BO59rXyO_dwWO-TRCTh8UH8q5RvrsaLDR2Qd0rlZEaqD92-4OXnypLxJijELiyMXDcGP8C3VUm6R39dw3H4DBwSNheB_0E7Jj8FBxsKQuegWlno-oNK-EROLdw5MAYDrNZ5l60joDD9vvUzVC-zQoo8xSOfN8ArkWMtJ9jq78NsxxWJhAwYMI5eO11R50nFDwV0MJl5hJpmhDpOI-ITESkNbFIY8pwpIUS2AiqmCOEEeWMJ62EKslaMlZpookVllsmWxegls9zcwmgptRgQ5MWcVG1lgvlXjtW8kQ7ACUGX4FGtYuTcDGKyWYLr_8evgH72FvtrqodDVArl5_mFuzprzIrls0Q56b_qjn8BnHNtlo |
| linkProvider | ProQuest |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT8JAEJ4gmqgH38YH6h702Nguu932YAxBiQQkJmDCDbtlF0mkIEWNf8rf6GxLhcTEGwfPTbaZzPT75rGdD-DcUy5TMpCWQPq3mGMrC2lIWlRoT2KCwGW6JKkuGg2v3fYfcvCV_QtjrlVmmJgAdXcYmh75JQaeMGMCh12PXi2jGmWmq5mERhoWNfX5gSVbfFW9Qf9eUFq5bZXvrKmqgDVCbppYIXdZgKzv28pmgVaO33W4oHboS58qn0uBKZHNPeG5RZfLQBQDR3bdkGlfe1oERTx3CZaZQf_kqmDzp6djC4EJC81EEzATp78gPuGtyuZ_s3gLNqYZMimlIb0NORXtwPrc3sRd6JVnO8tJtlaFDDVpIdWQZn_Qx3odywtSeumhRZPnQUyCqEtaZipC0hVNoXnH3PSe9COSSVyQKeLtweNCLN2HfDSM1AGQkHNFFXeLDGNWa8-XWMvpwHNDpAem6CEUMq91pp993Jm57Ojvx2ewete6r3fq1UbtGNaoERVO-joFyE_Gb-oEVsL3ST8enyYRRuBp0Q7-BgJLD5s |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LS8NAEB60iujBt_iougc9BpPtbnZzEKmPammpghX0VLPJbg3YtLZR8a_565y0iRYEbz14XsgyzDDfNzOb-QAOpHaZVr6yBMK_xRxbWwhDyqLCSIUEgavRkqS6aDTk_b13MwWf-b8w6bPKPCcOE3XYDdIe-REGnkjHBA47MtmziJvzyknvxUoVpNJJay6nMQqRmv54x_JtcFw9R18fUlq5aJ5dWZnCgNVDnEqsgLvMRwbg2dpmvtGOFzpcUDvwlEe1x5VAemRzKaRbcrnyRcl3VOgGzHhGGuGX8LvTMCNdYbMCzJw-XF_Wvzs8thBIX2guoYC8nP5K-EMUqyz9Z_uXYTHjzqQ8CvYVmNLxKiyMbVRcg_bZzzZzki9cIV1DmmgJuY06EVbyWHiQ8nMbLUqeOgPixyFppvMSMlreFKR3jM31SRSTXPyCZLlwHe4mYukGFOJurDeBBJxrqrlbYhjNxkhPYZVnfOkGCBxM0y0o5h5sZQlh0Ppx3_bfx_swh35t1auN2g7M01RteNjwKUIh6b_qXZgN3pJo0N_Lwo3A46Q9_AXaqxm2 |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Comparative+Analysis+of+Text+Similarity+Algorithms+and+Their+Practical+Applications+in+Computer+Science&rft.jtitle=Elektrotehniski+Vestnik&rft.au=Poljak%2C+Josip&rft.au=Cr%C4%8Di%C4%87%2C+Dra%C5%BEen&rft.au=Horvat%2C+Tomislav&rft.date=2025-01-01&rft.pub=Elektrotehniski+Vestnik&rft.issn=0013-5852&rft.eissn=2232-3236&rft.volume=92&rft.issue=3&rft.spage=151&rft.epage=156&rft.externalDBID=HAS_PDF_LINK |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0013-5852&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0013-5852&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0013-5852&client=summon |