Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting

High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplis...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:SC18: International Conference for High Performance Computing, Networking, Storage and Analysis s. 135 - 147
Hlavní autori: Pan, Tony C., Misra, Sanchit, Aluru, Srinivas
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: IEEE 01.11.2018
Predmet:
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
AbstractList High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
Author Aluru, Srinivas
Pan, Tony C.
Misra, Sanchit
Author_xml – sequence: 1
  givenname: Tony C.
  surname: Pan
  fullname: Pan, Tony C.
– sequence: 2
  givenname: Sanchit
  surname: Misra
  fullname: Misra, Sanchit
– sequence: 3
  givenname: Srinivas
  surname: Aluru
  fullname: Aluru, Srinivas
BookMark eNotj7tOwzAUQI0EAy2MTCz-gQQ_YscZqxQIUqGVKAtLdR3ftBZ5VI47lK9vJJjOdI50ZuS6H3ok5IGzlHNWPH2WqWDcpIwxnl2RGVfSaCNNVtyS7_Ux-s7_-n5PK78_0A2GZggd9DXSpR9j8PYU0dF37IZwphsI0LbY0grGA92CbXGkk0CXHwv6k3QYaDmc-jj17shNA-2I9_-ck6-X521ZJav161u5WCXAlYmJLBrLOKIB7QSgkJbl2iBqJRXT0mFtmyJ3shY1nwasAHDGKGZd3Qihczknj39dj4i7Y_AdhPPOaK3yTMsL4nBORA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SC.2018.00014
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1538683849
9781538683842
EndPage 147
ExternalDocumentID 8665746
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673
IEDL.DBID RIE
IngestDate Thu Jun 29 18:39:01 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673
PageCount 13
ParticipantIDs ieee_primary_8665746
PublicationCentury 2000
PublicationDate 2018-Nov
PublicationDateYYYYMMDD 2018-11-01
PublicationDate_xml – month: 11
  year: 2018
  text: 2018-Nov
PublicationDecade 2010
PublicationTitle SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.7982374
Snippet High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of...
SourceID ieee
SourceType Publisher
StartPage 135
SubjectTerms Bandwidth
Bioinformatics
cache-aware optimizations
distributed memory algorithms
Genomics
Hash functions
Hash tables
k-mer counting
Memory management
Prefetching
Sequential analysis
vectorization
Title Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting
URI https://ieeexplore.ieee.org/document/8665746
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGA7b8OBJZRO_ycGjdW2TJu1RNscOOgtOGV5GPt7g0G3SboL-evN2Y_PgxVsIhMAbQp4kzwchlzw0oTHcBSoyDCPMwkBHwAORaOcRvr_mVmTM5zs5GKSjUZbXyNVGCwMAFfkMrrFZ_eXbuVniU1kbvdkkF3VSl1KutFpb28z2YweZWkiNDFGT8ysspTorenv_m2WftLaiO5pvjpMDUoNZk7w8-C09nXz7HoqUDJpvif60i6a3mFcFlt4jY_aL5qrAcJR32lflKx2iLqqkfgDtDm7oWzCFgnbW6RAt8tS7HXb6wToOwdcxSRcBy5wOI4BUCRsriJnGXzQAkbDEAw0LRrtMWmZig8ZKOlbKpmkSamucB1mSHZLGbD6DI0ItcBE5jw1N5hGb4NoZzbRwNpGWe0B3TJpYl_HHyvFivC7Jyd_dp2QXC79S6J2RxqJYwjnZMZ-LSVlcVMv0A80Slqg
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JSwMxFH7UKuhJpRV3c_Do2Fky21FaS8V2HLBK8VImyQsWbSszraC_3rxpaT148RYCIfBCyJfkWwAuuS1tKbm2Mkd6FGFmW8JBbgW-0Abhm2tuScZ87oZJEg0GcVqBq5UWBhFL8hleU7P8y1dTOaensgZ5s4U82IBNn3PXWai11saZjccmcbWIHGmTKudXXEp5WrR3_zfPHtTXsjuWrg6UfajgpAYvD2ZTj0ffpocRKYOla6o_a5HtLSVWoWI94sx-sTTLKR7lnXWy4pX1SRlVMDOAtZIb9maNMWfNZT5EHZ7at_1mx1oGIphK-tHM8mItbAcxygLlZuh6gv7REAPf8w3UUCiFjkPlSVeStZJws0xFkW8LJbWBWaF3ANXJdIKHwBTywNEGHcrYYLaACy2FJwKt_FBxA-mOoEZ1GX4sPC-Gy5Ic_919Adudfq877N4l9yewQ4uw0OudQnWWz_EMtuTnbFTk5-WS_QDQj5nv
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Optimizing+High+Performance+Distributed+Memory+Parallel+Hash+Tables+for+DNA+k-mer+Counting&rft.au=Pan%2C+Tony+C.&rft.au=Misra%2C+Sanchit&rft.au=Aluru%2C+Srinivas&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=135&rft.epage=147&rft_id=info:doi/10.1109%2FSC.2018.00014&rft.externalDocID=8665746