Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons
The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the...
Uložené v:
| Vydané v: | Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 1122 - 1132 |
|---|---|
| Hlavní autori: | , , , , , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
01.05.2020
|
| Predmet: | |
| ISSN: | 1530-2075 |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains. |
|---|---|
| AbstractList | The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains. |
| Author | Ratsch, Gunnar Solomonik, Edgar Besta, Maciej Hoefler, Torsten Kanakagiri, Raghavendra Mustafa, Harun Karasikov, Mikhail |
| Author_xml | – sequence: 1 givenname: Maciej surname: Besta fullname: Besta, Maciej organization: ETH Zurich,Department of Computer Science – sequence: 2 givenname: Raghavendra surname: Kanakagiri fullname: Kanakagiri, Raghavendra organization: Indian Institute of Technology Tirupati,Department of Computer Science – sequence: 3 givenname: Harun surname: Mustafa fullname: Mustafa, Harun organization: ETH Zurich,Department of Computer Science – sequence: 4 givenname: Mikhail surname: Karasikov fullname: Karasikov, Mikhail organization: ETH Zurich,Department of Computer Science – sequence: 5 givenname: Gunnar surname: Ratsch fullname: Ratsch, Gunnar organization: ETH Zurich,Department of Computer Science – sequence: 6 givenname: Torsten surname: Hoefler fullname: Hoefler, Torsten organization: ETH Zurich,Department of Computer Science – sequence: 7 givenname: Edgar surname: Solomonik fullname: Solomonik, Edgar organization: University of Illinois at Urbana-Champaign,Department of Computer Science |
| BookMark | eNotj91KAzEUhKMo2NY-gQh5ga0n2WSTXEr_pWBBvbVksyca6WZLsr3o27ugVzMwMx_MmNzELiIhjwxmjIF52u4X-zehDBczDhxmAIzpKzI1SjPFNau0quCajJgsoeCg5B0Z5_wDQ7cUZkQ-513bnmNwtg9dLJbeBxcw9vTFOmdTQ3Now9Gm0F-o7xLdhK_vYo9p8K2NDuki5D6F-txjQ9cYuxbpgDwNi9zFfE9uvT1mnP7rhHyslu_zTbF7XW_nz7si8Mr0hXUKTFML8Lx0TEtdaQ1g6tpJ27Cm4t4IoRRKX9eSe11JqwSXKECr0hpWTsjDHzcg4uGUQmvT5TAEZvhf_gI-o1d0 |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/IPDPS47924.2020.00118 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 9781728168760 1728168767 |
| EISSN | 1530-2075 |
| EndPage | 1132 |
| ExternalDocumentID | 9139876 |
| Genre | orig-research |
| GroupedDBID | 29O 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI OCL RIE RIL |
| ID | FETCH-LOGICAL-i269t-ac709db40f23c1858688009bbc5ad1d62f94477e5fbb52f865a7425e40873a913 |
| IEDL.DBID | RIE |
| ISICitedReferencesCount | 43 |
| ISICitedReferencesURI | http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000643734800108&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| IngestDate | Wed Aug 27 02:33:56 EDT 2025 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-i269t-ac709db40f23c1858688009bbc5ad1d62f94477e5fbb52f865a7425e40873a913 |
| PageCount | 11 |
| ParticipantIDs | ieee_primary_9139876 |
| PublicationCentury | 2000 |
| PublicationDate | 2020-05-01 |
| PublicationDateYYYYMMDD | 2020-05-01 |
| PublicationDate_xml | – month: 05 year: 2020 text: 2020-05-01 day: 01 |
| PublicationDecade | 2020 |
| PublicationTitle | Proceedings - IEEE International Parallel and Distributed Processing Symposium |
| PublicationTitleAbbrev | IPDPS |
| PublicationYear | 2020 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| SSID | ssj0020349 |
| Score | 2.057641 |
| Snippet | The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 1122 |
| SubjectTerms | Bioinformatics Cyclops Tensor Framework Distributed Jaccard Distance Distributed Jaccard similarity Genome Sequence Distance Genomics High-Performance Genome Processing Indexes k-Mers Matrix-Matrix Multiplication Metagenome Sequence Distance Sequential analysis Sparse matrices |
| Title | Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons |
| URI | https://ieeexplore.ieee.org/document/9139876 |
| WOSCitedRecordID | wos000643734800108&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV27TsMwFLXaioGpQIt4ywMjpoljx_HcB4-higRInaj8uJYyNEV98P3YaUhViYUt8pAo1766PvY95yB0D1bG3EFCLDVAmPLLOFM6JY45BmASlTlXmU2I6TSbzWTeQg8NFwYAquYzeAyP1V2-XZptOCobBAlLn71t1BYi3XG1GnAVdFZqhk4cycFLPsrfmPDowmNAGlUXDtmBg0pVQCbd_336BPX3TDycNzXmFLWgPEPdXysGXGdmD30eED3IuBKG8G_Er8oHb2XxulgUHsT6PTf221Qc2jtIvicN4FEQ0A3eV2DxE5TLBeBhY1G47qOPyfh9-Exq6wRS0FRuiDIiklazyNHE-JKcpT5PI6m14crGNqVOMiYEcKc1py5LufIYmQOLMpEo_7PnqFMuS7hAmDkB2oCLVOJYDFQz4ECtVIm0VCp-iXohXPOvnTrGvI7U1d_D1-g4zMeuZfAGdTarLdyiI_O9Kdaru2pKfwAnk6Xz |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFG4QTfSECsbf9uDRyta123oGERTJEjHhJOna12QHhuGHf7_tmCMkXrwtPWzZa19ev_Z934fQPWjhcwMB0VQBYdIu41imITHMMAAVyNiYwmwiGo3iyUQkNfRQcWEAoGg-g0f3WNzl67lau6OytpOwtNm7h_adc1bJ1qrglVNaKTk6vifag6SbvLPI4guLAqlXXDnEOx4qRQnpNf738WPU2nLxcFJVmRNUg_wUNX7NGHCZm030uUP1IE-FNIR9I36RNnwLjZfZLLMw1u66sd2oYtfgQZItbQB3nYSuc78CjZ8hn88AdyqTwmULffSexp0-Kc0TSEZDsSJSRZ7QKfMMDZQtynFoM9UTaaq41L4OqRGMRRFwk6acmjjk0qJkDsyLo0Danz1D9XyewznCzESQKjCeDAzzgaYMOFAtZCA0FZJfoKYL1_Rro48xLSN1-ffwHTrsj9-G0-Fg9HqFjtzcbBoIr1F9tVjDDTpQ36tsubgtpvcH-b2pPA |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+IEEE+International+Parallel+and+Distributed+Processing+Symposium&rft.atitle=Communication-Efficient+Jaccard+similarity+for+High-Performance+Distributed+Genome+Comparisons&rft.au=Besta%2C+Maciej&rft.au=Kanakagiri%2C+Raghavendra&rft.au=Mustafa%2C+Harun&rft.au=Karasikov%2C+Mikhail&rft.date=2020-05-01&rft.pub=IEEE&rft.eissn=1530-2075&rft.spage=1122&rft.epage=1132&rft_id=info:doi/10.1109%2FIPDPS47924.2020.00118&rft.externalDocID=9139876 |