MinJoin++: a fast algorithm for string similarity joins under edit distance

We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:The VLDB journal Jg. 33; H. 2; S. 281 - 299
Hauptverfasser: Karpov, Nikolai, Zhang, Haoyu, Zhang, Qin
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Berlin/Heidelberg Springer Berlin Heidelberg 01.03.2024
Springer Nature B.V
Schlagworte:
ISSN:1066-8888, 0949-877X
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
AbstractList We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data mining and bioinformatics. It finds many applications in data cleaning and integration, collaborative filtering, genome sequence assembly, etc. This problem has attracted a lot of attention in the past two decades. However, all previous algorithms either cannot scale to long strings and large similarity thresholds, or suffer from imperfect accuracy. In this paper, we propose a new algorithm for edit similarity joins using a novel string partition-based approach. We show that, theoretically, our algorithm finds all similar pairs with high probability and runs in linear time (plus a data-dependent verification step). The algorithm can also be easily parallelized. Experiments on real-world datasets show that our algorithm outperforms the state-of-the-art algorithms for edit similarity joins by orders of magnitudes in running time and achieves perfect accuracy on most datasets that we have tested.
Author Zhang, Haoyu
Zhang, Qin
Karpov, Nikolai
Author_xml – sequence: 1
  givenname: Nikolai
  surname: Karpov
  fullname: Karpov, Nikolai
  organization: Indiana University
– sequence: 2
  givenname: Haoyu
  surname: Zhang
  fullname: Zhang, Haoyu
  organization: Meta Inc
– sequence: 3
  givenname: Qin
  orcidid: 0000-0002-6851-3115
  surname: Zhang
  fullname: Zhang, Qin
  email: qzhangcs@indiana.edu
  organization: Indiana University
BookMark eNp9kE1LAzEQhoNUsK3-AU8Bj2V1NtlNdr1J8bviRcFbmO4mNWWbrUl6aH-9W1cQPDSHCQzvMzM8IzJwrdOEnKdwmQLIq9AVWSTAeAJQgEh2R2QIZVYmhZQfAzJMQYik6N4JGYWwBADGWD4kzy_WPbXWTSbXFKnBECk2i9bb-LmipvU0RG_dgga7sg127S1ddvFAN67WnuraRlrbENFV-pQcG2yCPvv9x-T97vZt-pDMXu8fpzezpGI82yUIWVqLHE1ZGsYNamRGFPMSRVVhBbIuclllOZcIaZkZBC6AaRB8nrO5TJGPyUU_d-3br40OUS3bjXfdSsVZLmTBylR0KdanKt-G4LVRa29X6LcqBbWXpnppqpOmfqSpXQcV_6DKRoy2ddGjbQ6jvEfDem9M-7-rDlDf8w-DvA
CitedBy_id crossref_primary_10_14778_3749646_3749660
Cites_doi 10.1145/2627692.2627706
10.1093/bioinformatics/btaa252
10.1109/AXMEDIS.2006.40
10.1186/gb-2013-14-6-405
10.1109/TKDE.2012.79
10.1145/2213836.2213847
10.1007/978-3-662-44753-6_5
10.1145/1989323.1989431
10.1109/ICICIC.2008.422
10.1145/1242572.1242591
10.1109/ICDE.2008.4497434
10.1145/3292500.3330853
10.1016/S0019-9958(85)80046-2
10.1145/3097983.3098003
10.1145/275487.275495
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023.
Copyright_xml – notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
– notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Germany, part of Springer Nature 2023.
DBID AAYXX
CITATION
8FE
8FG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
GNUQQ
HCIFZ
JQ2
K7-
P5Z
P62
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
PRINS
DOI 10.1007/s00778-023-00806-z
DatabaseName CrossRef
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central UK/Ireland
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials - QC
ProQuest Central
Technology Collection
ProQuest One
ProQuest Central Korea
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
Computer Science Database
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
ProQuest Central China
DatabaseTitle CrossRef
Advanced Technologies & Aerospace Collection
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest One Academic Eastern Edition
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central China
ProQuest Central
Advanced Technologies & Aerospace Database
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
ProQuest Central Korea
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList Advanced Technologies & Aerospace Collection

Database_xml – sequence: 1
  dbid: P5Z
  name: Advanced Technologies & Aerospace Database
  url: https://search.proquest.com/hightechjournals
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 0949-877X
EndPage 299
ExternalDocumentID 10_1007_s00778_023_00806_z
GrantInformation_xml – fundername: National Science Foundation
  grantid: CCF-1844234
  funderid: http://dx.doi.org/10.13039/100000001
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
-~X
.4S
.86
.DC
.VR
06D
0R~
123
1N0
1SB
2.D
203
29R
2J2
2JN
2JY
2KG
2KM
2LR
2P1
2VQ
2~H
3-Y
30V
4.4
406
408
409
40D
40E
5QI
5VS
67Z
6NX
8TC
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AAKMM
AALFJ
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAWTV
AAYFX
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTD
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFS
ACHSB
ACHXU
ACKNC
ACM
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADL
ADQRH
ADRFC
ADTPH
ADURQ
ADYFF
ADZKW
AEBTG
AEBYY
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AENSD
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFLOW
AFQWF
AFWIH
AFWTZ
AFWXC
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMXSW
AMYLF
AMYQR
AOCGG
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
B-.
BA0
BBWZM
BDATZ
BGNMA
BSONS
CAG
CCLIF
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
EBLON
EBS
EDO
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNWQR
GQ6
GQ7
GQ8
GUFHI
GXS
H13
HF~
HG5
HG6
HGAVV
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I07
I09
IHE
IJ-
IKXTQ
ITM
IWAJR
IXC
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
KDC
KOV
KOW
LAS
LHSKQ
LLZTM
M4Y
MA-
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
P0-
P19
P2P
P9O
PF0
PT4
PT5
QOK
QOS
R4E
R89
R9I
RHV
RIG
RNI
RNS
ROL
RPX
RSV
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
VXZ
W23
W48
W7O
WK8
YLTOR
YZZ
Z45
Z7R
Z7X
Z83
Z88
Z8M
Z8R
Z8W
Z92
ZMTXR
~EX
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
ADHKG
AEFXT
AEJOY
AEZWR
AFDZB
AFFHD
AFHIU
AFKRA
AFOHR
AGQPQ
AHPBZ
AHWEU
AIXLP
AKRVB
ARAPS
ATHPR
AYFIA
BENPR
BGLVJ
CCPQU
CITATION
HCIFZ
K7-
PHGZM
PHGZT
PQGLB
8FE
8FG
AZQEC
DWQXO
GNUQQ
JQ2
P62
PKEHL
PQEST
PQQKQ
PQUKI
PRINS
ID FETCH-LOGICAL-c234z-a041d65af99f23faea2f68b9a6ccac07d857c4537a0194fa03602e063b52b71a3
IEDL.DBID RSV
ISSN 1066-8888
IngestDate Mon Oct 06 18:32:43 EDT 2025
Sat Nov 29 03:17:19 EST 2025
Tue Nov 18 20:53:23 EST 2025
Fri Feb 21 02:42:13 EST 2025
IsPeerReviewed false
IsScholarly true
Issue 2
Keywords String similarity joins
Edit distance
Local hash minima
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c234z-a041d65af99f23faea2f68b9a6ccac07d857c4537a0194fa03602e063b52b71a3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-6851-3115
PQID 3256782916
PQPubID 2043708
PageCount 19
ParticipantIDs proquest_journals_3256782916
crossref_primary_10_1007_s00778_023_00806_z
crossref_citationtrail_10_1007_s00778_023_00806_z
springer_journals_10_1007_s00778_023_00806_z
PublicationCentury 2000
PublicationDate 20240300
2024-03-00
20240301
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 3
  year: 2024
  text: 20240300
PublicationDecade 2020
PublicationPlace Berlin/Heidelberg
PublicationPlace_xml – name: Berlin/Heidelberg
– name: New York
PublicationSubtitle The International Journal on Very Large Data Bases
PublicationTitle The VLDB journal
PublicationTitleAbbrev The VLDB Journal
PublicationYear 2024
Publisher Springer Berlin Heidelberg
Springer Nature B.V
Publisher_xml – name: Springer Berlin Heidelberg
– name: Springer Nature B.V
References Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007)
Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019)
LiGDengDWangJFengJPASS-JOIN: a partition-based method for similarity joinsPVLDB201153253264
Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008)
WangWQinJXiaoCLinXShenHTVchunkjoin: an efficient algorithm for edit similarity joinsIEEE Trans. Knowl. Data Eng.2013258191619291:CAS:528:DC%2BC2cXns1amurY%3D10.1109/TKDE.2012.79
SongYTangHZhangHZhangQOverlap detection on long, error-prone sequencing reads via smooth q-gramBioinformatics20203619483848451:CAS:528:DC%2BB3MXht12gtrrJ10.1093/bioinformatics/btaa25232311007
Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
RobertsRJCarneiroMOSchatzMCThe advantages of SMRT sequencingGenome Biol.201314640510.1186/gb-2013-14-6-405238227313953343
UkkonenEAlgorithms for approximate string matchingInf. Control1985641–310011883709310.1016/S0019-9958(85)80046-2
Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006)
WangJLiGFengJTrie-join: efficient trie-based string similarity joins with edit-distance constraintsPVLDB20103112191230
XiaoCWangWLinXEd-join: an efficient algorithm for similarity joins with edit distance constraintsPVLDB2008119339442806294
Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011)
Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
JiangYLiGFengJLiWString similarity joins: an experimental evaluationPVLDB201478625636
Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
WandeltSDengDGerdjikovSMishraSMitankinPPatilMSiragusaETiskinAWangWWangJLeserUState-of-the-art in string similarity search and joinSIGMOD Record2014431647610.1145/2627692.2627706
Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017)
Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608
806_CR8
806_CR10
806_CR21
806_CR2
806_CR3
806_CR1
S Wandelt (806_CR16) 2014; 43
806_CR6
G Li (806_CR9) 2011; 5
806_CR4
Y Jiang (806_CR7) 2014; 7
806_CR5
C Xiao (806_CR20) 2008; 1
RJ Roberts (806_CR12) 2013; 14
Y Song (806_CR13) 2020; 36
J Wang (806_CR17) 2010; 3
806_CR18
806_CR23
806_CR11
806_CR22
806_CR14
W Wang (806_CR19) 2013; 25
E Ukkonen (806_CR15) 1985; 64
References_xml – reference: WandeltSDengDGerdjikovSMishraSMitankinPPatilMSiragusaETiskinAWangWWangJLeserUState-of-the-art in string similarity search and joinSIGMOD Record2014431647610.1145/2627692.2627706
– reference: RobertsRJCarneiroMOSchatzMCThe advantages of SMRT sequencingGenome Biol.201314640510.1186/gb-2013-14-6-405238227313953343
– reference: JiangYLiGFengJLiWString similarity joins: an experimental evaluationPVLDB201478625636
– reference: SongYTangHZhangHZhangQOverlap detection on long, error-prone sequencing reads via smooth q-gramBioinformatics20203619483848451:CAS:528:DC%2BB3MXht12gtrrJ10.1093/bioinformatics/btaa25232311007
– reference: UkkonenEAlgorithms for approximate string matchingInf. Control1985641–310011883709310.1016/S0019-9958(85)80046-2
– reference: Bocek, T., Hunt, E., Stiller, B., Hecht, F.: Fast similarity search in large dictionaries. University (2007)
– reference: Zini, M., Fabbri, M., Moneglia, M., Panunzi, A.: Plagiarism detection through multilevel text comparison. In: Proceedings of the Second International Conference on Automated Production of Cross Media Content for Multi-Channel Distribution, AXMEDIS 2006, Leeds, UK, December 13–15, 2006, pp. 181–185. IEEE Computer Society (2006)
– reference: Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001)
– reference: LiGDengDWangJFengJPASS-JOIN: a partition-based method for similarity joinsPVLDB201153253264
– reference: Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD, pp. 1033–1044 (2011)
– reference: Bayardo, R.J., Ma, Y., Srikant, R.: Scaling up all pairs similarity search. In: WWW, pp. 131–140 (2007)
– reference: Wang, J., Li, G., Feng, J.: Can we beat the prefix filtering?: an adaptive framework for similarity join and search. In: SIGMOD, pp. 85–96 (2012)
– reference: Ciaccia, P., Patella, M., Zezula, P.: M-tree: an efficient access method for similarity search in metric spaces. In: VLDB, pp. 426–435 (1997)
– reference: Li, C., Lu, J., Lu, Y.: Efficient merging and filtering algorithms for approximate string searches. In: ICDE, pp. 257–266 (2008)
– reference: Myers, G.: Efficient local alignment discovery amongst noisy long reads. In: Brown, D.G., Morgenstern, B. (eds.), Algorithms in Bioinformatics—14th International Workshop, WABI 2014, Wroclaw, Poland, September 8–10, 2014. Proceedings, vol. 8701 of Lecture Notes in Computer Science, pp. 52–67. Springer (2014)
– reference: WangJLiGFengJTrie-join: efficient trie-based string similarity joins with edit-distance constraintsPVLDB20103112191230
– reference: Zhang, H., Zhang, Q.: Embedjoin: efficient edit similarity joins via embeddings. In: KDD, pp. 585–594 (2017)
– reference: Dai, X., Yan, X., Zhou, K., Wang, Y., Yang, H., Cheng, J., Sigir. J., Huang, X., Chang, Y., Cheng, X., Kamps, J., Murdock, V., Wen, J., Liu, Y. (eds.) ACM, pp. 599–608
– reference: Su, Z., Ahn, B.-R., Eom, K.-Y., Kang, M.-K., Kim, J.-P., Kim, M.-K.: Plagiarism detection using the levenshtein distance and smith-waterman algorithm. In: 2008 3rd International Conference on Innovative Computing Information and Control, pp. 569–569 (2008)
– reference: WangWQinJXiaoCLinXShenHTVchunkjoin: an efficient algorithm for edit similarity joinsIEEE Trans. Knowl. Data Eng.2013258191619291:CAS:528:DC%2BC2cXns1amurY%3D10.1109/TKDE.2012.79
– reference: Arasu, A., Ganti, V., Kaushik, R.: Efficient exact set-similarity joins. In: VLDB, pp. 918–929 (2006)
– reference: XiaoCWangWLinXEd-join: an efficient algorithm for similarity joins with edit distance constraintsPVLDB2008119339442806294
– reference: Zhang, H., Zhang, Q.: Minjoin: efficient edit similarity joins via local hash minima. In: KDD, pp. 1093–1103. ACM (2019)
– volume: 3
  start-page: 1219
  issue: 1
  year: 2010
  ident: 806_CR17
  publication-title: PVLDB
– volume: 43
  start-page: 64
  issue: 1
  year: 2014
  ident: 806_CR16
  publication-title: SIGMOD Record
  doi: 10.1145/2627692.2627706
– ident: 806_CR6
– volume: 36
  start-page: 4838
  issue: 19
  year: 2020
  ident: 806_CR13
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btaa252
– ident: 806_CR23
  doi: 10.1109/AXMEDIS.2006.40
– volume: 7
  start-page: 625
  issue: 8
  year: 2014
  ident: 806_CR7
  publication-title: PVLDB
– volume: 14
  start-page: 405
  issue: 6
  year: 2013
  ident: 806_CR12
  publication-title: Genome Biol.
  doi: 10.1186/gb-2013-14-6-405
– volume: 25
  start-page: 1916
  issue: 8
  year: 2013
  ident: 806_CR19
  publication-title: IEEE Trans. Knowl. Data Eng.
  doi: 10.1109/TKDE.2012.79
– ident: 806_CR18
  doi: 10.1145/2213836.2213847
– volume: 1
  start-page: 933
  issue: 1
  year: 2008
  ident: 806_CR20
  publication-title: PVLDB
– volume: 5
  start-page: 253
  issue: 3
  year: 2011
  ident: 806_CR9
  publication-title: PVLDB
– ident: 806_CR10
  doi: 10.1007/978-3-662-44753-6_5
– ident: 806_CR11
  doi: 10.1145/1989323.1989431
– ident: 806_CR14
  doi: 10.1109/ICICIC.2008.422
– ident: 806_CR2
  doi: 10.1145/1242572.1242591
– ident: 806_CR3
– ident: 806_CR8
  doi: 10.1109/ICDE.2008.4497434
– ident: 806_CR22
  doi: 10.1145/3292500.3330853
– ident: 806_CR5
– volume: 64
  start-page: 100
  issue: 1–3
  year: 1985
  ident: 806_CR15
  publication-title: Inf. Control
  doi: 10.1016/S0019-9958(85)80046-2
– ident: 806_CR21
  doi: 10.1145/3097983.3098003
– ident: 806_CR1
– ident: 806_CR4
  doi: 10.1145/275487.275495
SSID ssj0002225
Score 2.391654
Snippet We study the problem of computing similarity joins under edit distance on a set of strings. Edit similarity joins is a fundamental problem in databases, data...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 281
SubjectTerms Accuracy
Algorithms
Bioinformatics
Computer Science
Data mining
Database Management
Datasets
Genomes
Regular Paper
Run time (computers)
Similarity
Strings
SummonAdditionalLinks – databaseName: Advanced Technologies & Aerospace Database
  dbid: P5Z
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELagMLBQnqJQkAe2YpE6D9ssCCEqxKPqAKhiiRwnhiCaliYw9NdzTp1WINGFOc7JyWfffefHfQgdU6F5EpiZJiMHEhTpk4hxhziKCRlLztyy2PPTHet2eb8venbBLbfHKiufWDrqeKjMGvmpC7EZohmwmfPRBzGqUWZ31UpoLKMVUyXBSDf0_OeZJza5TLnbGQQEMj1uL82UV-dMHRtOIGIRQ5oCMvkZmOZs89cGaRl3OvX_9ngDrVvGiS-mQ2QTLSXZFqpXag7YTu5tdHufZjfDNGu1zrDEWuYFlu8vYLB4HWCgttgofGQvOE8HKaTDwN7xGzTPsbmGNsYQBAscGzYK1nbQY-fq4fKaWKkFoqjrTYh0vHYc-FILoamrZSKpDngkZAAIK4fF3GfK810mgRJ6WkLcc2gC9CbyacTa0t1FtWyYJXsIK_hicF3KY9wHrqaF4hGNlCOAGilIiRuoXf3nUNk65EYO4z2cVVAusQkBm7DEJpw0UGv2zmhahWNh62YFSGhnZB7O0WigkwrS-eO_re0vtnaA1ijwnOmxtCaqFePP5BCtqq8izcdH5Xj8Brto5Ak
  priority: 102
  providerName: ProQuest
Title MinJoin++: a fast algorithm for string similarity joins under edit distance
URI https://link.springer.com/article/10.1007/s00778-023-00806-z
https://www.proquest.com/docview/3256782916
Volume 33
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: Advanced Technologies & Aerospace Database
  customDbUrl:
  eissn: 0949-877X
  dateEnd: 20241211
  omitProxy: false
  ssIdentifier: ssj0002225
  issn: 1066-8888
  databaseCode: P5Z
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: https://search.proquest.com/hightechjournals
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: Computer Science Database
  customDbUrl:
  eissn: 0949-877X
  dateEnd: 20241211
  omitProxy: false
  ssIdentifier: ssj0002225
  issn: 1066-8888
  databaseCode: K7-
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/compscijour
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 0949-877X
  dateEnd: 20241211
  omitProxy: false
  ssIdentifier: ssj0002225
  issn: 1066-8888
  databaseCode: BENPR
  dateStart: 20230101
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 0949-877X
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002225
  issn: 1066-8888
  databaseCode: RSV
  dateStart: 19970101
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnZ1LS8NAEIAHWz148S1Wa9mDt7qQbh678abSIj5K8UXxEjabpEZqKk300F_vbJpEFBX0kksmQ5h9zDfszgzAAXMjETp6pUnfwABF2tTnwqCG4q4MpOBmXuz5_pL3-2I4dAdFUlha3nYvjyTznbpKdtOVZwRFH0M15jh0VoNFdHdCN2y4vrmv9l8dweRnnI5DMb4TRarM9zo-u6MPxvxyLJp7m97q__5zDVYKuiTH8-mwDgthsgGrZecGUizkTbi4ipPzSZy020dEkkimGZHj0WQaZ4_PBDGW6G4eyYik8XOMoS-SOnlC8ZTolLMpQYeXkUCTJ2rbgrte9_b0jBZtFahipjWj0rA6gWPLyHUjZkYylCxyhO9KB0dTGTwQNleWbXKJ-GdFEn2cwUJEGd9mPu9IcxvqySQJd4AohrQhhLK4sJHLIlcJn_nKcBGDFIa_DeiU1vVUUXNct74Ye1W15NxaHlrLy63lzRrQrr55mVfc-FW6WQ6aV6y-1DOR45B8kHwbcFgO0sfrn7Xt_k18D5YZMs78SloT6tn0NdyHJfWWxem0BYsn3f7gugW1C07xObAfWvlMfQep9N4a
linkProvider Springer Nature
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT8MwDLZ4SXDhjRjPHOA0Irr0kRQJIcRDwMbEARC3kqbtKIIO1gJiP4rfiNO1m0CCGwfOTS25cezvq2MbYIO5kQgdfdKkbyBBkTb1uTCoobgrAym4mTd7vm7wZlPc3LgXQ_BR1sLoa5WlT8wdddBW-h_5tomxGaMZopm9p2eqp0bp7Go5QqNnFvXw_Q0pW7p7eoj7u8nY8dHlwQktpgpQxUyrS6Vh1QLHlpHrRsyMZChZ5AjflQ4qowweCJsryza5RPRjRRJdvMFCjOS-zXxekybKHYZRy0Q9dBKY077n19wpz646DkVmKYoinbxUT_fNERQjJNUgzaHdr4FwgG6_JWTzOHc89d--0DRMFoia7PeOwAwMhcksTJXTKkjhvOagfh4nZ-04qVZ3iCSRTDMiH1qoQHb3SBC6Ez3BJGmRNH6Mke4jOyH3uDwlusyuQzDIZyTQaBulzcPVn6i0ACNJOwkXgSj8wuialcWFjVg0cpXwma8MF6GfQspfgVq5r54q-qzrcR8PXr9DdG4LHtqCl9uC161Atf_OU6_LyK-rV0oD8AqPk3qD3a_AVmlCg8c_S1v6Xdo6jJ9cnje8xmmzvgwTDDFd7wreCoxknZdwFcbUaxannbX8LBC4_WvT-gSnsEC7
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpZ3JTsMwEEBHbEJc2BFl9YFbsZo6ix1uCKhYKyQW9RY5TgxBNK2awKFfzzhNyiJAQpzjjKKxnXmj2QD2mK9F7JmbJkMLHRTp0pALi1qK-zKSgttFs-f7S95ui07Hv_5QxV9ku1chyVFNg-nSlOaNfqQb48I304VGULQ31CCPR4eTMO2YRHrjr9_cj__Fxpsp4p2eR9HXE2XZzPcyPpumd978EiItLE9r4f_fvAjzJXWSw9ExWYKJOF2GhWqiAykv-ApcXCXpeS9J6_UDIomWWU7k80NvkOSPXYJ4S8yUj_SBZEk3QZcYCZ484fKMmFK0AUFDmJPIEClKW4W71snt0Sktxy1QxWxnSKXlNCPPldr3NbO1jCXTngh96eEuK4tHwuXKcW0uEQsdLdH2WSxGxAldFvKmtNdgKu2l8ToQxZBChFAOFy7ymvaVCFmoLB_xSKFbXINmpelAlb3IzUiM52DcRbnQVoDaCgptBcMa1Mfv9EedOH5dvVVtYFDeyiywke-QiJCIa7Bfbdj745-lbfxt-S7MXh-3gsuz9sUmzDHEoFHW2hZM5YOXeBtm1GueZIOd4rC-AfiT5ng
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MinJoin%2B%2B%3A+a+fast+algorithm+for+string+similarity+joins+under+edit+distance&rft.jtitle=The+VLDB+journal&rft.au=Karpov%2C+Nikolai&rft.au=Zhang%2C+Haoyu&rft.au=Zhang%2C+Qin&rft.date=2024-03-01&rft.pub=Springer+Berlin+Heidelberg&rft.issn=1066-8888&rft.eissn=0949-877X&rft.volume=33&rft.issue=2&rft.spage=281&rft.epage=299&rft_id=info:doi/10.1007%2Fs00778-023-00806-z&rft.externalDocID=10_1007_s00778_023_00806_z
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1066-8888&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1066-8888&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1066-8888&client=summon