Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting

High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplis...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	SC18: International Conference for High Performance Computing, Networking, Storage and Analysis s. 135 - 147
Hlavní autori:	Pan, Tony C., Misra, Sanchit, Aluru, Srinivas
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	IEEE 01.11.2018
Predmet:	Bandwidth Bioinformatics cache-aware optimizations distributed memory algorithms Genomics Hash functions Hash tables k-mer counting Memory management Prefetching Sequential analysis vectorization
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
AbstractList	High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
Author	Aluru, Srinivas Pan, Tony C. Misra, Sanchit
Author_xml	– sequence: 1 givenname: Tony C. surname: Pan fullname: Pan, Tony C. – sequence: 2 givenname: Sanchit surname: Misra fullname: Misra, Sanchit – sequence: 3 givenname: Srinivas surname: Aluru fullname: Aluru, Srinivas
BookMark	eNotj7tOwzAUQI0EAy2MTCz-gQQ_YscZqxQIUqGVKAtLdR3ftBZ5VI47lK9vJJjOdI50ZuS6H3ok5IGzlHNWPH2WqWDcpIwxnl2RGVfSaCNNVtyS7_Ux-s7_-n5PK78_0A2GZggd9DXSpR9j8PYU0dF37IZwphsI0LbY0grGA92CbXGkk0CXHwv6k3QYaDmc-jj17shNA-2I9_-ck6-X521ZJav161u5WCXAlYmJLBrLOKIB7QSgkJbl2iBqJRXT0mFtmyJ3shY1nwasAHDGKGZd3Qihczknj39dj4i7Y_AdhPPOaK3yTMsL4nBORA
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/SC.2018.00014
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1538683849 9781538683842
EndPage	147
ExternalDocumentID	8665746
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:39:01 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673
PageCount	13
ParticipantIDs	ieee_primary_8665746
PublicationCentury	2000
PublicationDate	2018-Nov
PublicationDateYYYYMMDD	2018-11-01
PublicationDate_xml	– month: 11 year: 2018 text: 2018-Nov
PublicationDecade	2010
PublicationTitle	SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev	SC
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.7982374
Snippet	High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of...
SourceID	ieee
SourceType	Publisher
StartPage	135
SubjectTerms	Bandwidth Bioinformatics cache-aware optimizations distributed memory algorithms Genomics Hash functions Hash tables k-mer counting Memory management Prefetching Sequential analysis vectorization
Title	Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting
URI	https://ieeexplore.ieee.org/document/8665746
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGA7b8OBJZRO_ycGjdW2TJu1RNscOOgtOGV5GPt7g0G3SboL-evN2Y_PgxVsIhMAbQp4kzwchlzw0oTHcBSoyDCPMwkBHwAORaOcRvr_mVmTM5zs5GKSjUZbXyNVGCwMAFfkMrrFZ_eXbuVniU1kbvdkkF3VSl1KutFpb28z2YweZWkiNDFGT8ysspTorenv_m2WftLaiO5pvjpMDUoNZk7w8-C09nXz7HoqUDJpvif60i6a3mFcFlt4jY_aL5qrAcJR32lflKx2iLqqkfgDtDm7oWzCFgnbW6RAt8tS7HXb6wToOwdcxSRcBy5wOI4BUCRsriJnGXzQAkbDEAw0LRrtMWmZig8ZKOlbKpmkSamucB1mSHZLGbD6DI0ItcBE5jw1N5hGb4NoZzbRwNpGWe0B3TJpYl_HHyvFivC7Jyd_dp2QXC79S6J2RxqJYwjnZMZ-LSVlcVMv0A80Slqg
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JSwMxFH7UKuhJpRV3c_Do2Fky21FaS8V2HLBK8VImyQsWbSszraC_3rxpaT148RYCIfBCyJfkWwAuuS1tKbm2Mkd6FGFmW8JBbgW-0Abhm2tuScZ87oZJEg0GcVqBq5UWBhFL8hleU7P8y1dTOaensgZ5s4U82IBNn3PXWai11saZjccmcbWIHGmTKudXXEp5WrR3_zfPHtTXsjuWrg6UfajgpAYvD2ZTj0ffpocRKYOla6o_a5HtLSVWoWI94sx-sTTLKR7lnXWy4pX1SRlVMDOAtZIb9maNMWfNZT5EHZ7at_1mx1oGIphK-tHM8mItbAcxygLlZuh6gv7REAPf8w3UUCiFjkPlSVeStZJws0xFkW8LJbWBWaF3ANXJdIKHwBTywNEGHcrYYLaACy2FJwKt_FBxA-mOoEZ1GX4sPC-Gy5Ic_919Adudfq877N4l9yewQ4uw0OudQnWWz_EMtuTnbFTk5-WS_QDQj5nv
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Optimizing+High+Performance+Distributed+Memory+Parallel+Hash+Tables+for+DNA+k-mer+Counting&rft.au=Pan%2C+Tony+C.&rft.au=Misra%2C+Sanchit&rft.au=Aluru%2C+Srinivas&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=135&rft.epage=147&rft_id=info:doi/10.1109%2FSC.2018.00014&rft.externalDocID=8665746