MinJoin++: a fast algorithm for string similarity joins under edit distance
We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc...
Gespeichert in:
| Veröffentlicht in: | The VLDB journal Jg. 33; H. 2; S. 281 - 299 |
|---|---|
| Hauptverfasser: | , , |
| Format: | Journal Article |
| Sprache: | Englisch |
| Veröffentlicht: |
Berlin/Heidelberg
Springer Berlin Heidelberg
01.03.2024
Springer Nature B.V |
| Schlagworte: | |
| ISSN: | 1066-8888, 0949-877X |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested. |
|---|---|
| AbstractList | We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested. |
| Author | Zhang, Haoyu Zhang, Qin Karpov, Nikolai |
| Author_xml | – sequence: 1 givenname: Nikolai surname: Karpov fullname: Karpov, Nikolai organization: Indiana University – sequence: 2 givenname: Haoyu surname: Zhang fullname: Zhang, Haoyu organization: Meta Inc – sequence: 3 givenname: Qin orcidid: 0000-0002-6851-3115 surname: Zhang fullname: Zhang, Qin email: qzhangcs@indiana.edu organization: Indiana University |
| BookMark | eNp9kE1LAzEQhoNUsK3-AU8Bj2V1NtlNdr1J8bviRcFbmO4mNWWbrUl6aH-9W1cQPDSHCQzvMzM8IzJwrdOEnKdwmQLIq9AVWSTAeAJQgEh2R2QIZVYmhZQfAzJMQYik6N4JGYWwBADGWD4kzy_WPbXWTSbXFKnBECk2i9bb-LmipvU0RG_dgga7sg127S1ddvFAN67WnuraRlrbENFV-pQcG2yCPvv9x-T97vZt-pDMXu8fpzezpGI82yUIWVqLHE1ZGsYNamRGFPMSRVVhBbIuclllOZcIaZkZBC6AaRB8nrO5TJGPyUU_d-3br40OUS3bjXfdSsVZLmTBylR0KdanKt-G4LVRa29X6LcqBbWXpnppqpOmfqSpXQcV_6DKRoy2ddGjbQ6jvEfDem9M-7-rDlDf8w-DvA |
| CitedBy_id | crossref_primary_10_14778_3749646_3749660 |
| Cites_doi | 10.1145/2627692.2627706 10.1093/bioinformatics/btaa252 10.1109/AXMEDIS.2006.40 10.1186/gb-2013-14-6-405 10.1109/TKDE.2012.79 10.1145/2213836.2213847 10.1007/978-3-662-44753-6_5 10.1145/1989323.1989431 10.1109/ICICIC.2008.422 10.1145/1242572.1242591 10.1109/ICDE.2008.4497434 10.1145/3292500.3330853 10.1016/S0019-9958(85)80046-2 10.1145/3097983.3098003 10.1145/275487.275495 |
| ContentType | Journal Article |
| Copyright | The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. |
| Copyright_xml | – notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. – notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. |
| DBID | AAYXX CITATION 8FE 8FG AFKRA ARAPS AZQEC BENPR BGLVJ CCPQU DWQXO GNUQQ HCIFZ JQ2 K7- P5Z P62 PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI PRINS |
| DOI | 10.1007/s00778-023-00806-z |
| DatabaseName | CrossRef ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central UK/Ireland Advanced Technologies & Computer Science Collection ProQuest Central Essentials - QC ProQuest Central Technology Collection ProQuest One ProQuest Central Korea ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection Computer Science Database Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition ProQuest Central China |
| DatabaseTitle | CrossRef Advanced Technologies & Aerospace Collection Computer Science Database ProQuest Central Student Technology Collection ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection ProQuest One Academic Eastern Edition SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central Advanced Technologies & Aerospace Database ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition ProQuest Central Korea ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New) |
| DatabaseTitleList | Advanced Technologies & Aerospace Collection |
| Database_xml | – sequence: 1 dbid: P5Z name: Advanced Technologies & Aerospace Database url: https://search.proquest.com/hightechjournals sourceTypes: Aggregation Database |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISSN | 0949-877X |
| EndPage | 299 |
| ExternalDocumentID | 10_1007_s00778_023_00806_z |
| GrantInformation_xml | – fundername: National Science Foundation grantid: CCF-1844234 funderid: http://dx.doi.org/10.13039/100000001 |
| GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C -~X .4S .86 .DC .VR 06D 0R~ 123 1N0 1SB 2.D 203 29R 2J2 2JN 2JY 2KG 2KM 2LR 2P1 2VQ 2~H 3-Y 30V 4.4 406 408 409 40D 40E 5QI 5VS 67Z 6NX 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AAKMM AALFJ AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAWTV AAYFX AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACKNC ACM ACMDZ ACMLO ACOKC ACOMO ACPIV ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADL ADQRH ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEBYY AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AENSD AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFLOW AFQWF AFWIH AFWTZ AFWXC AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGWIL AGWZB AGYKE AHAVH AHBYD AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN B-. BA0 BBWZM BDATZ BGNMA BSONS CAG CCLIF COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 EBLON EBS EDO EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNWQR GQ6 GQ7 GQ8 GUFHI GXS H13 HF~ HG5 HG6 HGAVV HMJXF HQYDN HRMNR HVGLF HZ~ I07 I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ KDC KOV KOW LAS LHSKQ LLZTM M4Y MA- N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM P0- P19 P2P P9O PF0 PT4 PT5 QOK QOS R4E R89 R9I RHV RIG RNI RNS ROL RPX RSV RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW VXZ W23 W48 W7O WK8 YLTOR YZZ Z45 Z7R Z7X Z83 Z88 Z8M Z8R Z8W Z92 ZMTXR ~EX AAPKM AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC ADHKG AEFXT AEJOY AEZWR AFDZB AFFHD AFHIU AFKRA AFOHR AGQPQ AHPBZ AHWEU AIXLP AKRVB ARAPS ATHPR AYFIA BENPR BGLVJ CCPQU CITATION HCIFZ K7- PHGZM PHGZT PQGLB 8FE 8FG AZQEC DWQXO GNUQQ JQ2 P62 PKEHL PQEST PQQKQ PQUKI PRINS |
| ID | FETCH-LOGICAL-c234z-a041d65af99f23faea2f68b9a6ccac07d857c4537a0194fa03602e063b52b71a3 |
| IEDL.DBID | RSV |
| ISSN | 1066-8888 |
| IngestDate | Mon Oct 06 18:32:43 EDT 2025 Sat Nov 29 03:17:19 EST 2025 Tue Nov 18 20:53:23 EST 2025 Fri Feb 21 02:42:13 EST 2025 |
| IsPeerReviewed | false |
| IsScholarly | true |
| Issue | 2 |
| Keywords | String similarity joins Edit distance Local hash minima |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c234z-a041d65af99f23faea2f68b9a6ccac07d857c4537a0194fa03602e063b52b71a3 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0002-6851-3115 |
| PQID | 3256782916 |
| PQPubID | 2043708 |
| PageCount | 19 |
| ParticipantIDs | proquest_journals_3256782916 crossref_primary_10_1007_s00778_023_00806_z crossref_citationtrail_10_1007_s00778_023_00806_z springer_journals_10_1007_s00778_023_00806_z |
| PublicationCentury | 2000 |
| PublicationDate | 20240300 2024-03-00 20240301 |
| PublicationDateYYYYMMDD | 2024-03-01 |
| PublicationDate_xml | – month: 3 year: 2024 text: 20240300 |
| PublicationDecade | 2020 |
| PublicationPlace | Berlin/Heidelberg |
| PublicationPlace_xml | – name: Berlin/Heidelberg – name: New York |
| PublicationSubtitle | The International Journal on Very Large Data Bases |
| PublicationTitle | The VLDB journal |
| PublicationTitleAbbrev | The VLDB Journal |
| PublicationYear | 2024 |
| Publisher | Springer Berlin Heidelberg Springer Nature B.V |
| Publisher_xml | – name: Springer Berlin Heidelberg – name: Springer Nature B.V |
| References | Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007) Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019) LiGDengDWangJFengJPASS-JOIN: a partition-based method for similarity joinsPVLDB201153253264 Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008) WangWQinJXiaoCLinXShenHTVchunkjoin: an efficient algorithm for edit similarity joinsIEEE Trans. Knowl. Data Eng.2013258191619291:CAS:528:DC%2BC2cXns1amurY%3D10.1109/TKDE.2012.79 SongYTangHZhangHZhangQOverlap detection on long, error-prone sequencing reads via smooth q-gramBioinformatics20203619483848451:CAS:528:DC%2BB3MXht12gtrrJ10.1093/bioinformatics/btaa25232311007 Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007) RobertsRJCarneiroMOSchatzMCThe advantages of SMRT sequencingGenome Biol.201314640510.1186/gb-2013-14-6-405238227313953343 UkkonenEAlgorithms for approximate string matchingInf. Control1985641–310011883709310.1016/S0019-9958(85)80046-2 Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006) WangJLiGFengJTrie-join: efficient trie-based string similarity joins with edit-distance constraintsPVLDB20103112191230 XiaoCWangWLinXEd-join: an efficient algorithm for similarity joins with edit distance constraintsPVLDB2008119339442806294 Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008) Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011) Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997) JiangYLiGFengJLiWString similarity joins: an experimental evaluationPVLDB201478625636 Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006) Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014) Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001) Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012) WandeltSDengDGerdjikovSMishraSMitankinPPatilMSiragusaETiskinAWangWWangJLeserUState-of-the-art in string similarity search and joinSIGMOD Record2014431647610.1145/2627692.2627706 Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017) Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608 806_CR8 806_CR10 806_CR21 806_CR2 806_CR3 806_CR1 S Wandelt (806_CR16) 2014; 43 806_CR6 G Li (806_CR9) 2011; 5 806_CR4 Y Jiang (806_CR7) 2014; 7 806_CR5 C Xiao (806_CR20) 2008; 1 RJ Roberts (806_CR12) 2013; 14 Y Song (806_CR13) 2020; 36 J Wang (806_CR17) 2010; 3 806_CR18 806_CR23 806_CR11 806_CR22 806_CR14 W Wang (806_CR19) 2013; 25 E Ukkonen (806_CR15) 1985; 64 |
| References_xml | – reference: WandeltSDengDGerdjikovSMishraSMitankinPPatilMSiragusaETiskinAWangWWangJLeserUState-of-the-art in string similarity search and joinSIGMOD Record2014431647610.1145/2627692.2627706 – reference: RobertsRJCarneiroMOSchatzMCThe advantages of SMRT sequencingGenome Biol.201314640510.1186/gb-2013-14-6-405238227313953343 – reference: JiangYLiGFengJLiWString similarity joins: an experimental evaluationPVLDB201478625636 – reference: SongYTangHZhangHZhangQOverlap detection on long, error-prone sequencing reads via smooth q-gramBioinformatics20203619483848451:CAS:528:DC%2BB3MXht12gtrrJ10.1093/bioinformatics/btaa25232311007 – reference: UkkonenEAlgorithms for approximate string matchingInf. Control1985641–310011883709310.1016/S0019-9958(85)80046-2 – reference: Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007) – reference: Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006) – reference: Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001) – reference: LiGDengDWangJFengJPASS-JOIN: a partition-based method for similarity joinsPVLDB201153253264 – reference: Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011) – reference: Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007) – reference: Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012) – reference: Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997) – reference: Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008) – reference: Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014) – reference: WangJLiGFengJTrie-join: efficient trie-based string similarity joins with edit-distance constraintsPVLDB20103112191230 – reference: Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017) – reference: Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608 – reference: Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008) – reference: WangWQinJXiaoCLinXShenHTVchunkjoin: an efficient algorithm for edit similarity joinsIEEE Trans. Knowl. Data Eng.2013258191619291:CAS:528:DC%2BC2cXns1amurY%3D10.1109/TKDE.2012.79 – reference: Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006) – reference: XiaoCWangWLinXEd-join: an efficient algorithm for similarity joins with edit distance constraintsPVLDB2008119339442806294 – reference: Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019) – volume: 3 start-page: 1219 issue: 1 year: 2010 ident: 806_CR17 publication-title: PVLDB – volume: 43 start-page: 64 issue: 1 year: 2014 ident: 806_CR16 publication-title: SIGMOD Record doi: 10.1145/2627692.2627706 – ident: 806_CR6 – volume: 36 start-page: 4838 issue: 19 year: 2020 ident: 806_CR13 publication-title: Bioinformatics doi: 10.1093/bioinformatics/btaa252 – ident: 806_CR23 doi: 10.1109/AXMEDIS.2006.40 – volume: 7 start-page: 625 issue: 8 year: 2014 ident: 806_CR7 publication-title: PVLDB – volume: 14 start-page: 405 issue: 6 year: 2013 ident: 806_CR12 publication-title: Genome Biol. doi: 10.1186/gb-2013-14-6-405 – volume: 25 start-page: 1916 issue: 8 year: 2013 ident: 806_CR19 publication-title: IEEE Trans. Knowl. Data Eng. doi: 10.1109/TKDE.2012.79 – ident: 806_CR18 doi: 10.1145/2213836.2213847 – volume: 1 start-page: 933 issue: 1 year: 2008 ident: 806_CR20 publication-title: PVLDB – volume: 5 start-page: 253 issue: 3 year: 2011 ident: 806_CR9 publication-title: PVLDB – ident: 806_CR10 doi: 10.1007/978-3-662-44753-6_5 – ident: 806_CR11 doi: 10.1145/1989323.1989431 – ident: 806_CR14 doi: 10.1109/ICICIC.2008.422 – ident: 806_CR2 doi: 10.1145/1242572.1242591 – ident: 806_CR3 – ident: 806_CR8 doi: 10.1109/ICDE.2008.4497434 – ident: 806_CR22 doi: 10.1145/3292500.3330853 – ident: 806_CR5 – volume: 64 start-page: 100 issue: 1–3 year: 1985 ident: 806_CR15 publication-title: Inf. Control doi: 10.1016/S0019-9958(85)80046-2 – ident: 806_CR21 doi: 10.1145/3097983.3098003 – ident: 806_CR1 – ident: 806_CR4 doi: 10.1145/275487.275495 |
| SSID | ssj0002225 |
| Score | 2.391654 |
| Snippet | We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data... |
| SourceID | proquest crossref springer |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 281 |
| SubjectTerms | Accuracy Algorithms Bioinformatics Computer Science Data mining Database Management Datasets Genomes Regular Paper Run time (computers) Similarity Strings |
| SummonAdditionalLinks | – databaseName: Advanced Technologies & Aerospace Database dbid: P5Z link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELagMLBQnqJQkAe2YpE6D9ssCCEqxKPqAKhiiRwnhiCaliYw9NdzTp1WINGFOc7JyWfffefHfQgdU6F5EpiZJiMHEhTpk4hxhziKCRlLztyy2PPTHet2eb8venbBLbfHKiufWDrqeKjMGvmpC7EZohmwmfPRBzGqUWZ31UpoLKMVUyXBSDf0_OeZJza5TLnbGQQEMj1uL82UV-dMHRtOIGIRQ5oCMvkZmOZs89cGaRl3OvX_9ngDrVvGiS-mQ2QTLSXZFqpXag7YTu5tdHufZjfDNGu1zrDEWuYFlu8vYLB4HWCgttgofGQvOE8HKaTDwN7xGzTPsbmGNsYQBAscGzYK1nbQY-fq4fKaWKkFoqjrTYh0vHYc-FILoamrZSKpDngkZAAIK4fF3GfK810mgRJ6WkLcc2gC9CbyacTa0t1FtWyYJXsIK_hicF3KY9wHrqaF4hGNlCOAGilIiRuoXf3nUNk65EYO4z2cVVAusQkBm7DEJpw0UGv2zmhahWNh62YFSGhnZB7O0WigkwrS-eO_re0vtnaA1ijwnOmxtCaqFePP5BCtqq8izcdH5Xj8Brto5Ak priority: 102 providerName: ProQuest |
| Title | MinJoin++: a fast algorithm for string similarity joins under edit distance |
| URI | https://link.springer.com/article/10.1007/s00778-023-00806-z https://www.proquest.com/docview/3256782916 |
| Volume | 33 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVPQU databaseName: Advanced Technologies & Aerospace Database customDbUrl: eissn: 0949-877X dateEnd: 20241211 omitProxy: false ssIdentifier: ssj0002225 issn: 1066-8888 databaseCode: P5Z dateStart: 20230101 isFulltext: true titleUrlDefault: https://search.proquest.com/hightechjournals providerName: ProQuest – providerCode: PRVPQU databaseName: Computer Science Database customDbUrl: eissn: 0949-877X dateEnd: 20241211 omitProxy: false ssIdentifier: ssj0002225 issn: 1066-8888 databaseCode: K7- dateStart: 20230101 isFulltext: true titleUrlDefault: http://search.proquest.com/compscijour providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 0949-877X dateEnd: 20241211 omitProxy: false ssIdentifier: ssj0002225 issn: 1066-8888 databaseCode: BENPR dateStart: 20230101 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 0949-877X dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002225 issn: 1066-8888 databaseCode: RSV dateStart: 19970101 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1LS8NAEIAHWz148S1Wa9mDt7qQbh678abSIj5K8UXxEjabpEZqKk300F_vbJpEFBX0kksmQ5h9zDfszgzAAXMjETp6pUnfwABF2tTnwqCG4q4MpOBmXuz5_pL3-2I4dAdFUlha3nYvjyTznbpKdtOVZwRFH0M15jh0VoNFdHdCN2y4vrmv9l8dweRnnI5DMb4TRarM9zo-u6MPxvxyLJp7m97q__5zDVYKuiTH8-mwDgthsgGrZecGUizkTbi4ipPzSZy020dEkkimGZHj0WQaZ4_PBDGW6G4eyYik8XOMoS-SOnlC8ZTolLMpQYeXkUCTJ2rbgrte9_b0jBZtFahipjWj0rA6gWPLyHUjZkYylCxyhO9KB0dTGTwQNleWbXKJ-GdFEn2cwUJEGd9mPu9IcxvqySQJd4AohrQhhLK4sJHLIlcJn_nKcBGDFIa_DeiU1vVUUXNct74Ye1W15NxaHlrLy63lzRrQrr55mVfc-FW6WQ6aV6y-1DOR45B8kHwbcFgO0sfrn7Xt_k18D5YZMs78SloT6tn0NdyHJfWWxem0BYsn3f7gugW1C07xObAfWvlMfQep9N4a |
| linkProvider | Springer Nature |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT8MwDLZ4SXDhjRjPHOA0Irr0kRQJIcRDwMbEARC3kqbtKIIO1gJiP4rfiNO1m0CCGwfOTS25cezvq2MbYIO5kQgdfdKkbyBBkTb1uTCoobgrAym4mTd7vm7wZlPc3LgXQ_BR1sLoa5WlT8wdddBW-h_5tomxGaMZopm9p2eqp0bp7Go5QqNnFvXw_Q0pW7p7eoj7u8nY8dHlwQktpgpQxUyrS6Vh1QLHlpHrRsyMZChZ5AjflQ4qowweCJsryza5RPRjRRJdvMFCjOS-zXxekybKHYZRy0Q9dBKY077n19wpz646DkVmKYoinbxUT_fNERQjJNUgzaHdr4FwgG6_JWTzOHc89d--0DRMFoia7PeOwAwMhcksTJXTKkjhvOagfh4nZ-04qVZ3iCSRTDMiH1qoQHb3SBC6Ez3BJGmRNH6Mke4jOyH3uDwlusyuQzDIZyTQaBulzcPVn6i0ACNJOwkXgSj8wuialcWFjVg0cpXwma8MF6GfQspfgVq5r54q-qzrcR8PXr9DdG4LHtqCl9uC161Atf_OU6_LyK-rV0oD8AqPk3qD3a_AVmlCg8c_S1v6Xdo6jJ9cnje8xmmzvgwTDDFd7wreCoxknZdwFcbUaxannbX8LBC4_WvT-gSnsEC7 |
| linkToPdf | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpZ3JTsMwEEBHbEJc2BFl9YFbsZo6ix1uCKhYKyQW9RY5TgxBNK2awKFfzzhNyiJAQpzjjKKxnXmj2QD2mK9F7JmbJkMLHRTp0pALi1qK-zKSgttFs-f7S95ui07Hv_5QxV9ku1chyVFNg-nSlOaNfqQb48I304VGULQ31CCPR4eTMO2YRHrjr9_cj__Fxpsp4p2eR9HXE2XZzPcyPpumd978EiItLE9r4f_fvAjzJXWSw9ExWYKJOF2GhWqiAykv-ApcXCXpeS9J6_UDIomWWU7k80NvkOSPXYJ4S8yUj_SBZEk3QZcYCZ484fKMmFK0AUFDmJPIEClKW4W71snt0Sktxy1QxWxnSKXlNCPPldr3NbO1jCXTngh96eEuK4tHwuXKcW0uEQsdLdH2WSxGxAldFvKmtNdgKu2l8ToQxZBChFAOFy7ymvaVCFmoLB_xSKFbXINmpelAlb3IzUiM52DcRbnQVoDaCgptBcMa1Mfv9EedOH5dvVVtYFDeyiywke-QiJCIa7Bfbdj745-lbfxt-S7MXh-3gsuz9sUmzDHEoFHW2hZM5YOXeBtm1GueZIOd4rC-AfiT5ng |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MinJoin%2B%2B%3A+a+fast+algorithm+for+string+similarity+joins+under+edit+distance&rft.jtitle=The+VLDB+journal&rft.au=Karpov%2C+Nikolai&rft.au=Zhang%2C+Haoyu&rft.au=Zhang%2C+Qin&rft.date=2024-03-01&rft.pub=Springer+Berlin+Heidelberg&rft.issn=1066-8888&rft.eissn=0949-877X&rft.volume=33&rft.issue=2&rft.spage=281&rft.epage=299&rft_id=info:doi/10.1007%2Fs00778-023-00806-z&rft.externalDocID=10_1007_s00778_023_00806_z |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-8888&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-8888&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-8888&client=summon |