ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity

The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact sign...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE transactions on knowledge and data engineering Ročník 34; číslo 7; s. 3491 - 3506
Hlavný autor: Ertl, Otmar
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: New York IEEE 01.07.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:1041-4347, 1558-2191
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence between signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art.
AbstractList The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence between signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art.
Author Ertl, Otmar
Author_xml – sequence: 1
  givenname: Otmar
  orcidid: 0000-0001-7322-6332
  surname: Ertl
  fullname: Ertl, Otmar
  email: otmar.ertl@dynatrace.com
  organization: Dynatrace Research, Linz, Austria
BookMark eNp9kD1PwzAQhi0EEp8_ALFYYoEh5c5JanusSqFAEUgtc-Q4DjVKY7ANUv89Ca0YGJjuhud5T_cekt3WtYaQU4QBIsirxcP1ZMCAwSAFhsiHO-QA81wkDCXudjtkmGRpxvfJYQhvACC4wANSPXtXPtp2qsKSJnREx40KgbqazpxWjY3rZG7aYKP9MvQHGjWvztu4XAVaO0_j0tCLPkSVtscv6b3SWvmKzu3KNqpD18dkr1ZNMCfbeURebiaL8TSZPd3ejUezRDOZxkQB1FpKI8sMBa8zIyvUQgkhasUrLsuhzqscWfeL7gyBpR5WXGcAKHPNIT0i55vcd-8-Pk2IxZv79G13smBDzgTkIpMdxTeU9i4Eb-pC26iidW30yjYFQtFXWvSVFn2lxbbSzsQ_5ru3K-XX_zpnG8caY355iSIHgek3ME6CnQ
CODEN ITKEEH
CitedBy_id crossref_primary_10_1109_TETC_2021_3109559
crossref_primary_10_1007_s11042_024_18258_0
crossref_primary_10_1016_j_neucom_2022_07_013
crossref_primary_10_1109_TKDE_2023_3297195
crossref_primary_10_1109_TNANO_2021_3139394
crossref_primary_10_1109_TDSC_2021_3101280
crossref_primary_10_1109_TNSE_2023_3275809
crossref_primary_10_1093_nar_gkae609
crossref_primary_10_1109_TVCG_2024_3456383
Cites_doi 10.1145/276698.276876
10.18637/jss.v005.i08
10.1109/ICDM.2017.64
10.1017/CBO9781139924801
10.1109/ICDM.2018.00050
10.1109/SEQUEN.1997.666900
10.1145/2783258.2783406
10.1109/TKDE.2018.2867468
10.1145/3055399.3055443
10.1145/3128572.3140446
10.1145/509907.509965
10.1145/3292500.3330951
10.1145/1060745.1060840
10.1109/ICDM.2010.80
10.1145/1772690.1772759
10.1109/FOCS.2017.67
10.1145/3269206.3271690
10.1093/bioinformatics/btz354
10.1145/3219819.3220089
10.1007/978-1-4613-8643-8
10.1186/s40168-019-0653-2
10.1145/3230636
10.1145/3097983.3098111
10.25080/Majora-7ddc1dd1-00e
10.1016/j.cose.2019.101590
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TKDE.2020.3021176
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2191
EndPage 3506
ExternalDocumentID 10_1109_TKDE_2020_3021176
9185081
Genre orig-research
GroupedDBID -~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
F5P
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
UHB
1OL
5VS
9M8
AAYXX
ABFSI
AETIX
AGSQL
AI.
AIBXA
ALLEH
CITATION
E.L
H~9
ICLAB
IFJZH
RNI
RZB
TAF
VH1
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c293t-a00fc99e9b4187f4e9d1c8a888fa7d79b6c5d512104cc2981bc6d7c400195c703
IEDL.DBID RIE
ISICitedReferencesCount 22
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000805787100034&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 1041-4347
IngestDate Sun Nov 09 07:12:30 EST 2025
Sat Nov 29 02:36:02 EST 2025
Tue Nov 18 21:39:51 EST 2025
Wed Aug 27 02:24:37 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 7
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c293t-a00fc99e9b4187f4e9d1c8a888fa7d79b6c5d512104cc2981bc6d7c400195c703
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-7322-6332
PQID 2672805849
PQPubID 85438
PageCount 16
ParticipantIDs crossref_citationtrail_10_1109_TKDE_2020_3021176
ieee_primary_9185081
crossref_primary_10_1109_TKDE_2020_3021176
proquest_journals_2672805849
PublicationCentury 2000
PublicationDate 2022-07-01
PublicationDateYYYYMMDD 2022-07-01
PublicationDate_xml – month: 07
  year: 2022
  text: 2022-07-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2022
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References Fisher (ref31) 1938
ref34
ref15
ref37
ref10
ref2
ref1
Shrivastava (ref5)
ref17
ref39
ref19
ref18
Lv (ref22)
Shrivastava (ref14)
Andoni (ref25)
ref24
ref23
ref26
Ross (ref30) 2019
Cormen (ref32) 2013
ref20
ref41
Mai (ref12)
Shrivastava (ref13)
ref21
Ertl (ref16) 2017
Mitzenmacher (ref33) 2017
ref28
ref29
ref8
ref7
ref9
ref4
ref3
Yi (ref35) 2019
ref6
Shrivastava (ref27)
ref40
Li (ref11)
References_xml – ident: ref21
  doi: 10.1145/276698.276876
– ident: ref34
  doi: 10.18637/jss.v005.i08
– ident: ref7
  doi: 10.1109/ICDM.2017.64
– volume-title: Introduction to Probability Models
  year: 2019
  ident: ref30
– ident: ref2
  doi: 10.1017/CBO9781139924801
– ident: ref8
  doi: 10.1109/ICDM.2018.00050
– ident: ref1
  doi: 10.1109/SEQUEN.1997.666900
– start-page: 831
  volume-title: Proc. 35th Conf. Uncertainty Artif. Intell.
  ident: ref12
  article-title: On densification for minwise hashing
– ident: ref6
  doi: 10.1145/2783258.2783406
– ident: ref9
  doi: 10.1109/TKDE.2018.2867468
– start-page: 3154
  volume-title: Proc. 34th Int. Conf. Mach. Learn.
  ident: ref13
  article-title: Optimal densification for fast and accurate minwise hashing
– ident: ref26
  doi: 10.1145/3055399.3055443
– ident: ref40
  doi: 10.1145/3128572.3140446
– start-page: 1498
  volume-title: Proc. 30th Conf. Neural Inf. Process. Syst.
  ident: ref5
  article-title: Simple and efficient weighted minwise hashing
– year: 2019
  ident: ref35
  article-title: Wyhash
– ident: ref24
  doi: 10.1145/509907.509965
– start-page: 886
  volume-title: Proc. 17th Int. Conf. Artif. Intell. Statist.
  ident: ref27
  article-title: In defense of minhash over simhash
– ident: ref28
  doi: 10.1145/3292500.3330951
– ident: ref20
  doi: 10.1145/1060745.1060840
– ident: ref4
  doi: 10.1109/ICDM.2010.80
– ident: ref23
  doi: 10.1145/1772690.1772759
– ident: ref15
  doi: 10.1109/FOCS.2017.67
– volume-title: Statistical Tables for Biological, Agricultural and Medical Research
  year: 1938
  ident: ref31
– start-page: 1225
  volume-title: Proc. 28th Conf. Neural Inf. Process. Syst.
  ident: ref25
  article-title: Practical and optimal LSH for angular distance
– volume-title: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis
  year: 2017
  ident: ref33
– ident: ref10
  doi: 10.1145/3269206.3271690
– start-page: 950
  volume-title: Proc. 33rd Int. Conf. Very Large Data Bases
  ident: ref22
  article-title: Multi-probe LSH: Efficient indexing for high-dimensional similarity search
– ident: ref41
  doi: 10.1093/bioinformatics/btz354
– ident: ref3
  doi: 10.1145/3219819.3220089
– volume-title: Introduction to Algorithms
  year: 2013
  ident: ref32
– year: 2017
  ident: ref16
  article-title: SuperMinHash – A new minwise hashing algorithm for Jaccard similarity estimation
– ident: ref29
  doi: 10.1007/978-1-4613-8643-8
– start-page: 3113
  volume-title: Proc. 26th Conf. Neural Inf. Process. Syst.
  ident: ref11
  article-title: One permutation hashing
– ident: ref19
  doi: 10.1186/s40168-019-0653-2
– ident: ref37
  doi: 10.1145/3230636
– start-page: 732
  volume-title: Proc. 30th Conf. Uncertainty Artif. Intell.
  ident: ref14
  article-title: Improved densification of one permutation hashing
– ident: ref39
  doi: 10.1145/3097983.3098111
– ident: ref18
  doi: 10.25080/Majora-7ddc1dd1-00e
– ident: ref17
  doi: 10.1016/j.cose.2019.101590
SSID ssj0008781
Score 2.4974606
Snippet The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 3491
SubjectTerms Algorithms
Clustering
Clustering algorithms
Device-to-device communication
Estimation error
Indexes
Jaccard similarity
locality-sensitive hashing
Measurement
Minwise hashing
probabilistic data structures
Similarity
Statistical analysis
streaming algorithms
Time complexity
Title ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
URI https://ieeexplore.ieee.org/document/9185081
https://www.proquest.com/docview/2672805849
Volume 34
WOSCitedRecordID wos000805787100034&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1558-2191
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0008781
  issn: 1041-4347
  databaseCode: RIE
  dateStart: 19890101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD5M8UEfnFecN_Lgg4rRdMua5nF4QXSK4AXfSpqkrrCt0m2C_96TtBuCIvhWSlJKv_ac72tyvgNw0GIyUka0KDOJptzoFpUiTWg7SI3hiVBl1ftLV9zfR6-v8qEGJ7NaGGut33xmT92hX8s3uZ64X2VnEpMLc3XWc0KEZa3WLOpGwjckRXWBmqjFRbWCGTB59nR7cYlKsIkCFTNa4OxFvuUg31TlRyT26eWq_r8bW4HlikaSTon7KtTscA3q0xYNpPpi12Dpm9_gOpiHIk_usuG1GvUIJR3iW2KSPCVdl9KQkNNHt6HdhUDiB3X6b3mRjXuDEUF2S5AtkkN3kdLd-_OI3CgEpzDkMRtkKJLx3AY8X10-nV_Tqs0C1Zjrx1QxlmoprUx4EImUW2kCHSmUxqkSRsgk1G3TdkZjXOMM5Lk6NEJzX2uoMWJswvwwH9otICwJTJg4CxwruWApch-F2RB1cKi4DmwD2PTBx7ryIHetMPqx1yJMxg6r2GEVV1g14Hg25b004Phr8LoDZzawwqUBu1N04-oTHcXN0HXmQv4lt3-ftQOLTVfr4Pfm7sL8uJjYPVjQH-NsVOz7t-8LkLzVhg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFL2ICuqD3-L8zIMPKkbTLWuax-EHU-cQnOJbSZNUC7pKNwX_vTdpNwRF8K2UpJSe9t5zmtxzAfYaTEbKiAZlJtGUG92gUqQJbQapMTwRqqx6f-iIbjd6fJS3E3A0roWx1vrNZ_bYHfq1fJPrd_er7ERicmGuznqqyXmdldVa47gbCd-SFPUFqqIGF9UaZsDkSe_67By1YB0lKua0wBmMfMtCvq3Kj1jsE8zFwv9ubRHmKyJJWiXySzBh-8uwMGrSQKpvdhnmvjkOroC5LfLkJuu31eCZUNIivikmyVPScUkNKTm9c1vaXRAkflDr5SkvsuHz64AgvyXIF8m-u0jp7_15QK4UwlMYcpe9ZiiT8dwq3F-c907btGq0QDVm-yFVjKVaSisTHkQi5VaaQEcKxXGqhBEyCXXTNJ3VGNc4A5muDo3Q3FcbaowZazDZz_t2HQhLAhMmzgTHSi5YiuxHYT5EJRwqrgNbAzZ68LGuXMhdM4yX2KsRJmOHVeywiiusanA4nvJWWnD8NXjFgTMeWOFSg60RunH1kQ7ieuh6cyEDkxu_z9qFmXbvphN3LrvXmzBbd5UPfqfuFkwOi3e7DdP6Y5gNih3_Jn4B1mHYzQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ProbMinHash+-+A+Class+of+Locality-Sensitive+Hash+Algorithms+for+the+%28Probability%29+Jaccard+Similarity&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Ertl%2C+Otmar&rft.date=2022-07-01&rft.pub=IEEE&rft.issn=1041-4347&rft.volume=34&rft.issue=7&rft.spage=3491&rft.epage=3506&rft_id=info:doi/10.1109%2FTKDE.2020.3021176&rft.externalDocID=9185081
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon