Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons

The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:Proceedings - IEEE International Parallel and Distributed Processing Symposium s. 1122 - 1132
Hlavní autori: Besta, Maciej, Kanakagiri, Raghavendra, Mustafa, Harun, Karasikov, Mikhail, Ratsch, Gunnar, Hoefler, Torsten, Solomonik, Edgar
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.05.2020
Predmet:
ISSN:1530-2075
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.
AbstractList The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information retrieval, and many other areas. We design and implement SimilarityAtScale, the first communication-efficient distributed algorithm for computing the Jaccard similarity among pairs of large datasets. Our algorithm provides an efficient encoding of this problem into a multiplication of sparse matrices. Both the encoding and sparse matrix product are performed in a way that minimizes data movement in terms of communication and synchronization costs. We apply our algorithm to obtain similarity among all pairs of a set of large samples of genomes. This task is a key part of modern metagenomics analysis and an evergrowing need due to the increasing availability of high-throughput DNA sequencing data. The resulting scheme is the first to enable accurate Jaccard distance derivations for massive datasets, using large-scale distributed-memory systems. We package our routines in a tool, called GenomeAtScale, that combines the proposed algorithm with tools for processing input sequences. Our evaluation on real data illustrates that one can use GenomeAtScale to effectively employ tens of thousands of processors to reach new frontiers in large-scale genomic and metagenomic analysis. While GenomeAtScale can be used to foster DNA research, the more general underlying SimilarityAtScale algorithm may be used for high-performance distributed similarity computations in other data analytics application domains.
Author Ratsch, Gunnar
Solomonik, Edgar
Besta, Maciej
Hoefler, Torsten
Kanakagiri, Raghavendra
Mustafa, Harun
Karasikov, Mikhail
Author_xml – sequence: 1
  givenname: Maciej
  surname: Besta
  fullname: Besta, Maciej
  organization: ETH Zurich,Department of Computer Science
– sequence: 2
  givenname: Raghavendra
  surname: Kanakagiri
  fullname: Kanakagiri, Raghavendra
  organization: Indian Institute of Technology Tirupati,Department of Computer Science
– sequence: 3
  givenname: Harun
  surname: Mustafa
  fullname: Mustafa, Harun
  organization: ETH Zurich,Department of Computer Science
– sequence: 4
  givenname: Mikhail
  surname: Karasikov
  fullname: Karasikov, Mikhail
  organization: ETH Zurich,Department of Computer Science
– sequence: 5
  givenname: Gunnar
  surname: Ratsch
  fullname: Ratsch, Gunnar
  organization: ETH Zurich,Department of Computer Science
– sequence: 6
  givenname: Torsten
  surname: Hoefler
  fullname: Hoefler, Torsten
  organization: ETH Zurich,Department of Computer Science
– sequence: 7
  givenname: Edgar
  surname: Solomonik
  fullname: Solomonik, Edgar
  organization: University of Illinois at Urbana-Champaign,Department of Computer Science
BookMark eNotj91KAzEUhKMo2NY-gQh5ga0n2WSTXEr_pWBBvbVksyca6WZLsr3o27ugVzMwMx_MmNzELiIhjwxmjIF52u4X-zehDBczDhxmAIzpKzI1SjPFNau0quCajJgsoeCg5B0Z5_wDQ7cUZkQ-513bnmNwtg9dLJbeBxcw9vTFOmdTQ3Now9Gm0F-o7xLdhK_vYo9p8K2NDuki5D6F-txjQ9cYuxbpgDwNi9zFfE9uvT1mnP7rhHyslu_zTbF7XW_nz7si8Mr0hXUKTFML8Lx0TEtdaQ1g6tpJ27Cm4t4IoRRKX9eSe11JqwSXKECr0hpWTsjDHzcg4uGUQmvT5TAEZvhf_gI-o1d0
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IPDPS47924.2020.00118
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781728168760
1728168767
EISSN 1530-2075
EndPage 1132
ExternalDocumentID 9139876
Genre orig-research
GroupedDBID 29O
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
OCL
RIE
RIL
ID FETCH-LOGICAL-i269t-ac709db40f23c1858688009bbc5ad1d62f94477e5fbb52f865a7425e40873a913
IEDL.DBID RIE
ISICitedReferencesCount 43
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000643734800108&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 02:33:56 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i269t-ac709db40f23c1858688009bbc5ad1d62f94477e5fbb52f865a7425e40873a913
PageCount 11
ParticipantIDs ieee_primary_9139876
PublicationCentury 2000
PublicationDate 2020-05-01
PublicationDateYYYYMMDD 2020-05-01
PublicationDate_xml – month: 05
  year: 2020
  text: 2020-05-01
  day: 01
PublicationDecade 2020
PublicationTitle Proceedings - IEEE International Parallel and Distributed Processing Symposium
PublicationTitleAbbrev IPDPS
PublicationYear 2020
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0020349
Score 2.057641
Snippet The Jaccard similarity index is an important measure of the overlap of two sets, widely used in machine learning, computational genomics, information...
SourceID ieee
SourceType Publisher
StartPage 1122
SubjectTerms Bioinformatics
Cyclops Tensor Framework
Distributed Jaccard Distance
Distributed Jaccard similarity
Genome Sequence Distance
Genomics
High-Performance Genome Processing
Indexes
k-Mers
Matrix-Matrix Multiplication
Metagenome Sequence Distance
Sequential analysis
Sparse matrices
Title Communication-Efficient Jaccard similarity for High-Performance Distributed Genome Comparisons
URI https://ieeexplore.ieee.org/document/9139876
WOSCitedRecordID wos000643734800108&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV27TsMwFLXaioGpQIt4ywMjpoljx_HcB4-higRInaj8uJYyNEV98P3YaUhViYUt8pAo1766PvY95yB0D1bG3EFCLDVAmPLLOFM6JY45BmASlTlXmU2I6TSbzWTeQg8NFwYAquYzeAyP1V2-XZptOCobBAlLn71t1BYi3XG1GnAVdFZqhk4cycFLPsrfmPDowmNAGlUXDtmBg0pVQCbd_336BPX3TDycNzXmFLWgPEPdXysGXGdmD30eED3IuBKG8G_Er8oHb2XxulgUHsT6PTf221Qc2jtIvicN4FEQ0A3eV2DxE5TLBeBhY1G47qOPyfh9-Exq6wRS0FRuiDIiklazyNHE-JKcpT5PI6m14crGNqVOMiYEcKc1py5LufIYmQOLMpEo_7PnqFMuS7hAmDkB2oCLVOJYDFQz4ECtVIm0VCp-iXohXPOvnTrGvI7U1d_D1-g4zMeuZfAGdTarLdyiI_O9Kdaru2pKfwAnk6Xz
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PT8IwFG4QTfSECsbf9uDRyta123oGERTJEjHhJOna12QHhuGHf7_tmCMkXrwtPWzZa19ev_Z934fQPWjhcwMB0VQBYdIu41imITHMMAAVyNiYwmwiGo3iyUQkNfRQcWEAoGg-g0f3WNzl67lau6OytpOwtNm7h_adc1bJ1qrglVNaKTk6vifag6SbvLPI4guLAqlXXDnEOx4qRQnpNf738WPU2nLxcFJVmRNUg_wUNX7NGHCZm030uUP1IE-FNIR9I36RNnwLjZfZLLMw1u66sd2oYtfgQZItbQB3nYSuc78CjZ8hn88AdyqTwmULffSexp0-Kc0TSEZDsSJSRZ7QKfMMDZQtynFoM9UTaaq41L4OqRGMRRFwk6acmjjk0qJkDsyLo0Danz1D9XyewznCzESQKjCeDAzzgaYMOFAtZCA0FZJfoKYL1_Rro48xLSN1-ffwHTrsj9-G0-Fg9HqFjtzcbBoIr1F9tVjDDTpQ36tsubgtpvcH-b2pPA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+-+IEEE+International+Parallel+and+Distributed+Processing+Symposium&rft.atitle=Communication-Efficient+Jaccard+similarity+for+High-Performance+Distributed+Genome+Comparisons&rft.au=Besta%2C+Maciej&rft.au=Kanakagiri%2C+Raghavendra&rft.au=Mustafa%2C+Harun&rft.au=Karasikov%2C+Mikhail&rft.date=2020-05-01&rft.pub=IEEE&rft.eissn=1530-2075&rft.spage=1122&rft.epage=1132&rft_id=info:doi/10.1109%2FIPDPS47924.2020.00118&rft.externalDocID=9139876