ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity
The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact sign...
Uložené v:
| Vydané v: | IEEE transactions on knowledge and data engineering Ročník 34; číslo 7; s. 3491 - 3506 |
|---|---|
| Hlavný autor: | |
| Médium: | Journal Article |
| Jazyk: | English |
| Vydavateľské údaje: |
New York
IEEE
01.07.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Predmet: | |
| ISSN: | 1041-4347, 1558-2191 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence between signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art. |
|---|---|
| AbstractList | The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements are associated with relative frequencies or probabilities. In combination with a hash algorithm that maps those weighted sets to compact signatures which allow fast estimation of pairwise similarities, it constitutes a valuable method for big data applications such as near-duplicate detection, nearest neighbor search, or clustering. This paper introduces a class of one-pass locality-sensitive hash algorithms that are orders of magnitude faster than the original approach. The performance gain is achieved by calculating signature components not independently, but collectively. Four different algorithms are proposed based on this idea. Two of them are statistically equivalent to the original approach and can be used as drop-in replacements. The other two may even improve the estimation error by introducing statistical dependence between signature components. Moreover, the presented techniques can be specialized for the conventional Jaccard similarity, resulting in highly efficient algorithms that outperform traditional minwise hashing and that are able to compete with the state of the art. |
| Author | Ertl, Otmar |
| Author_xml | – sequence: 1 givenname: Otmar orcidid: 0000-0001-7322-6332 surname: Ertl fullname: Ertl, Otmar email: otmar.ertl@dynatrace.com organization: Dynatrace Research, Linz, Austria |
| BookMark | eNp9kD1PwzAQhi0EEp8_ALFYYoEh5c5JanusSqFAEUgtc-Q4DjVKY7ANUv89Ca0YGJjuhud5T_cekt3WtYaQU4QBIsirxcP1ZMCAwSAFhsiHO-QA81wkDCXudjtkmGRpxvfJYQhvACC4wANSPXtXPtp2qsKSJnREx40KgbqazpxWjY3rZG7aYKP9MvQHGjWvztu4XAVaO0_j0tCLPkSVtscv6b3SWvmKzu3KNqpD18dkr1ZNMCfbeURebiaL8TSZPd3ejUezRDOZxkQB1FpKI8sMBa8zIyvUQgkhasUrLsuhzqscWfeL7gyBpR5WXGcAKHPNIT0i55vcd-8-Pk2IxZv79G13smBDzgTkIpMdxTeU9i4Eb-pC26iidW30yjYFQtFXWvSVFn2lxbbSzsQ_5ru3K-XX_zpnG8caY355iSIHgek3ME6CnQ |
| CODEN | ITKEEH |
| CitedBy_id | crossref_primary_10_1109_TETC_2021_3109559 crossref_primary_10_1007_s11042_024_18258_0 crossref_primary_10_1016_j_neucom_2022_07_013 crossref_primary_10_1109_TKDE_2023_3297195 crossref_primary_10_1109_TNANO_2021_3139394 crossref_primary_10_1109_TDSC_2021_3101280 crossref_primary_10_1109_TNSE_2023_3275809 crossref_primary_10_1093_nar_gkae609 crossref_primary_10_1109_TVCG_2024_3456383 |
| Cites_doi | 10.1145/276698.276876 10.18637/jss.v005.i08 10.1109/ICDM.2017.64 10.1017/CBO9781139924801 10.1109/ICDM.2018.00050 10.1109/SEQUEN.1997.666900 10.1145/2783258.2783406 10.1109/TKDE.2018.2867468 10.1145/3055399.3055443 10.1145/3128572.3140446 10.1145/509907.509965 10.1145/3292500.3330951 10.1145/1060745.1060840 10.1109/ICDM.2010.80 10.1145/1772690.1772759 10.1109/FOCS.2017.67 10.1145/3269206.3271690 10.1093/bioinformatics/btz354 10.1145/3219819.3220089 10.1007/978-1-4613-8643-8 10.1186/s40168-019-0653-2 10.1145/3230636 10.1145/3097983.3098111 10.25080/Majora-7ddc1dd1-00e 10.1016/j.cose.2019.101590 |
| ContentType | Journal Article |
| Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2022 |
| DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
| DOI | 10.1109/TKDE.2020.3021176 |
| DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
| DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
| DatabaseTitleList | Technology Research Database |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Engineering Computer Science |
| EISSN | 1558-2191 |
| EndPage | 3506 |
| ExternalDocumentID | 10_1109_TKDE_2020_3021176 9185081 |
| Genre | orig-research |
| GroupedDBID | -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 UHB 1OL 5VS 9M8 AAYXX ABFSI AETIX AGSQL AI. AIBXA ALLEH CITATION E.L H~9 ICLAB IFJZH RNI RZB TAF VH1 7SC 7SP 8FD JQ2 L7M L~C L~D |
| ID | FETCH-LOGICAL-c293t-a00fc99e9b4187f4e9d1c8a888fa7d79b6c5d512104cc2981bc6d7c400195c703 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 22 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000805787100034&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| ISSN | 1041-4347 |
| IngestDate | Sun Nov 09 07:12:30 EST 2025 Sat Nov 29 02:36:02 EST 2025 Tue Nov 18 21:39:51 EST 2025 Wed Aug 27 02:24:37 EDT 2025 |
| IsPeerReviewed | true |
| IsScholarly | true |
| Issue | 7 |
| Language | English |
| License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-c293t-a00fc99e9b4187f4e9d1c8a888fa7d79b6c5d512104cc2981bc6d7c400195c703 |
| Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
| ORCID | 0000-0001-7322-6332 |
| PQID | 2672805849 |
| PQPubID | 85438 |
| PageCount | 16 |
| ParticipantIDs | crossref_citationtrail_10_1109_TKDE_2020_3021176 ieee_primary_9185081 crossref_primary_10_1109_TKDE_2020_3021176 proquest_journals_2672805849 |
| PublicationCentury | 2000 |
| PublicationDate | 2022-07-01 |
| PublicationDateYYYYMMDD | 2022-07-01 |
| PublicationDate_xml | – month: 07 year: 2022 text: 2022-07-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationPlace | New York |
| PublicationPlace_xml | – name: New York |
| PublicationTitle | IEEE transactions on knowledge and data engineering |
| PublicationTitleAbbrev | TKDE |
| PublicationYear | 2022 |
| Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
| References | Fisher (ref31) 1938 ref34 ref15 ref37 ref10 ref2 ref1 Shrivastava (ref5) ref17 ref39 ref19 ref18 Lv (ref22) Shrivastava (ref14) Andoni (ref25) ref24 ref23 ref26 Ross (ref30) 2019 Cormen (ref32) 2013 ref20 ref41 Mai (ref12) Shrivastava (ref13) ref21 Ertl (ref16) 2017 Mitzenmacher (ref33) 2017 ref28 ref29 ref8 ref7 ref9 ref4 ref3 Yi (ref35) 2019 ref6 Shrivastava (ref27) ref40 Li (ref11) |
| References_xml | – ident: ref21 doi: 10.1145/276698.276876 – ident: ref34 doi: 10.18637/jss.v005.i08 – ident: ref7 doi: 10.1109/ICDM.2017.64 – volume-title: Introduction to Probability Models year: 2019 ident: ref30 – ident: ref2 doi: 10.1017/CBO9781139924801 – ident: ref8 doi: 10.1109/ICDM.2018.00050 – ident: ref1 doi: 10.1109/SEQUEN.1997.666900 – start-page: 831 volume-title: Proc. 35th Conf. Uncertainty Artif. Intell. ident: ref12 article-title: On densification for minwise hashing – ident: ref6 doi: 10.1145/2783258.2783406 – ident: ref9 doi: 10.1109/TKDE.2018.2867468 – start-page: 3154 volume-title: Proc. 34th Int. Conf. Mach. Learn. ident: ref13 article-title: Optimal densification for fast and accurate minwise hashing – ident: ref26 doi: 10.1145/3055399.3055443 – ident: ref40 doi: 10.1145/3128572.3140446 – start-page: 1498 volume-title: Proc. 30th Conf. Neural Inf. Process. Syst. ident: ref5 article-title: Simple and efficient weighted minwise hashing – year: 2019 ident: ref35 article-title: Wyhash – ident: ref24 doi: 10.1145/509907.509965 – start-page: 886 volume-title: Proc. 17th Int. Conf. Artif. Intell. Statist. ident: ref27 article-title: In defense of minhash over simhash – ident: ref28 doi: 10.1145/3292500.3330951 – ident: ref20 doi: 10.1145/1060745.1060840 – ident: ref4 doi: 10.1109/ICDM.2010.80 – ident: ref23 doi: 10.1145/1772690.1772759 – ident: ref15 doi: 10.1109/FOCS.2017.67 – volume-title: Statistical Tables for Biological, Agricultural and Medical Research year: 1938 ident: ref31 – start-page: 1225 volume-title: Proc. 28th Conf. Neural Inf. Process. Syst. ident: ref25 article-title: Practical and optimal LSH for angular distance – volume-title: Probability and Computing: Randomization and Probabilistic Techniques in Algorithms and Data Analysis year: 2017 ident: ref33 – ident: ref10 doi: 10.1145/3269206.3271690 – start-page: 950 volume-title: Proc. 33rd Int. Conf. Very Large Data Bases ident: ref22 article-title: Multi-probe LSH: Efficient indexing for high-dimensional similarity search – ident: ref41 doi: 10.1093/bioinformatics/btz354 – ident: ref3 doi: 10.1145/3219819.3220089 – volume-title: Introduction to Algorithms year: 2013 ident: ref32 – year: 2017 ident: ref16 article-title: SuperMinHash – A new minwise hashing algorithm for Jaccard similarity estimation – ident: ref29 doi: 10.1007/978-1-4613-8643-8 – start-page: 3113 volume-title: Proc. 26th Conf. Neural Inf. Process. Syst. ident: ref11 article-title: One permutation hashing – ident: ref19 doi: 10.1186/s40168-019-0653-2 – ident: ref37 doi: 10.1145/3230636 – start-page: 732 volume-title: Proc. 30th Conf. Uncertainty Artif. Intell. ident: ref14 article-title: Improved densification of one permutation hashing – ident: ref39 doi: 10.1145/3097983.3098111 – ident: ref18 doi: 10.25080/Majora-7ddc1dd1-00e – ident: ref17 doi: 10.1016/j.cose.2019.101590 |
| SSID | ssj0008781 |
| Score | 2.4974606 |
| Snippet | The probability Jaccard similarity was recently proposed as a natural generalization of the Jaccard similarity to measure the proximity of sets whose elements... |
| SourceID | proquest crossref ieee |
| SourceType | Aggregation Database Enrichment Source Index Database Publisher |
| StartPage | 3491 |
| SubjectTerms | Algorithms Clustering Clustering algorithms Device-to-device communication Estimation error Indexes Jaccard similarity locality-sensitive hashing Measurement Minwise hashing probabilistic data structures Similarity Statistical analysis streaming algorithms Time complexity |
| Title | ProbMinHash - A Class of Locality-Sensitive Hash Algorithms for the (Probability) Jaccard Similarity |
| URI | https://ieeexplore.ieee.org/document/9185081 https://www.proquest.com/docview/2672805849 |
| Volume | 34 |
| WOSCitedRecordID | wos000805787100034&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| journalDatabaseRights | – providerCode: PRVIEE databaseName: IEEE Electronic Library (IEL) customDbUrl: eissn: 1558-2191 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0008781 issn: 1041-4347 databaseCode: RIE dateStart: 19890101 isFulltext: true titleUrlDefault: https://ieeexplore.ieee.org/ providerName: IEEE |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1bS8MwFD5M8UEfnFecN_Lgg4rRdMua5nF4QXSK4AXfSpqkrrCt0m2C_96TtBuCIvhWSlJKv_ac72tyvgNw0GIyUka0KDOJptzoFpUiTWg7SI3hiVBl1ftLV9zfR6-v8qEGJ7NaGGut33xmT92hX8s3uZ64X2VnEpMLc3XWc0KEZa3WLOpGwjckRXWBmqjFRbWCGTB59nR7cYlKsIkCFTNa4OxFvuUg31TlRyT26eWq_r8bW4HlikaSTon7KtTscA3q0xYNpPpi12Dpm9_gOpiHIk_usuG1GvUIJR3iW2KSPCVdl9KQkNNHt6HdhUDiB3X6b3mRjXuDEUF2S5AtkkN3kdLd-_OI3CgEpzDkMRtkKJLx3AY8X10-nV_Tqs0C1Zjrx1QxlmoprUx4EImUW2kCHSmUxqkSRsgk1G3TdkZjXOMM5Lk6NEJzX2uoMWJswvwwH9otICwJTJg4CxwruWApch-F2RB1cKi4DmwD2PTBx7ryIHetMPqx1yJMxg6r2GEVV1g14Hg25b004Phr8LoDZzawwqUBu1N04-oTHcXN0HXmQv4lt3-ftQOLTVfr4Pfm7sL8uJjYPVjQH-NsVOz7t-8LkLzVhg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1dS8MwFL2ICuqD3-L8zIMPKkbTLWuax-EHU-cQnOJbSZNUC7pKNwX_vTdpNwRF8K2UpJSe9t5zmtxzAfYaTEbKiAZlJtGUG92gUqQJbQapMTwRqqx6f-iIbjd6fJS3E3A0roWx1vrNZ_bYHfq1fJPrd_er7ERicmGuznqqyXmdldVa47gbCd-SFPUFqqIGF9UaZsDkSe_67By1YB0lKua0wBmMfMtCvq3Kj1jsE8zFwv9ubRHmKyJJWiXySzBh-8uwMGrSQKpvdhnmvjkOroC5LfLkJuu31eCZUNIivikmyVPScUkNKTm9c1vaXRAkflDr5SkvsuHz64AgvyXIF8m-u0jp7_15QK4UwlMYcpe9ZiiT8dwq3F-c907btGq0QDVm-yFVjKVaSisTHkQi5VaaQEcKxXGqhBEyCXXTNJ3VGNc4A5muDo3Q3FcbaowZazDZz_t2HQhLAhMmzgTHSi5YiuxHYT5EJRwqrgNbAzZ68LGuXMhdM4yX2KsRJmOHVeywiiusanA4nvJWWnD8NXjFgTMeWOFSg60RunH1kQ7ieuh6cyEDkxu_z9qFmXbvphN3LrvXmzBbd5UPfqfuFkwOi3e7DdP6Y5gNih3_Jn4B1mHYzQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=ProbMinHash+-+A+Class+of+Locality-Sensitive+Hash+Algorithms+for+the+%28Probability%29+Jaccard+Similarity&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Ertl%2C+Otmar&rft.date=2022-07-01&rft.pub=IEEE&rft.issn=1041-4347&rft.volume=34&rft.issue=7&rft.spage=3491&rft.epage=3506&rft_id=info:doi/10.1109%2FTKDE.2020.3021176&rft.externalDocID=9185081 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |