Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting
High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplis...
Uložené v:
| Vydané v: | SC18: International Conference for High Performance Computing, Networking, Storage and Analysis s. 135 - 147 |
|---|---|
| Hlavní autori: | , , |
| Médium: | Konferenčný príspevok.. |
| Jazyk: | English |
| Vydavateľské údaje: |
IEEE
01.11.2018
|
| Predmet: | |
| On-line prístup: | Získať plný text |
| Tagy: |
Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
|
| Abstract | High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter. |
|---|---|
| AbstractList | High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter. |
| Author | Aluru, Srinivas Pan, Tony C. Misra, Sanchit |
| Author_xml | – sequence: 1 givenname: Tony C. surname: Pan fullname: Pan, Tony C. – sequence: 2 givenname: Sanchit surname: Misra fullname: Misra, Sanchit – sequence: 3 givenname: Srinivas surname: Aluru fullname: Aluru, Srinivas |
| BookMark | eNotj7tOwzAUQI0EAy2MTCz-gQQ_YscZqxQIUqGVKAtLdR3ftBZ5VI47lK9vJJjOdI50ZuS6H3ok5IGzlHNWPH2WqWDcpIwxnl2RGVfSaCNNVtyS7_Ux-s7_-n5PK78_0A2GZggd9DXSpR9j8PYU0dF37IZwphsI0LbY0grGA92CbXGkk0CXHwv6k3QYaDmc-jj17shNA-2I9_-ck6-X521ZJav161u5WCXAlYmJLBrLOKIB7QSgkJbl2iBqJRXT0mFtmyJ3shY1nwasAHDGKGZd3Qihczknj39dj4i7Y_AdhPPOaK3yTMsL4nBORA |
| CODEN | IEEPAD |
| ContentType | Conference Proceeding |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1109/SC.2018.00014 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| EISBN | 1538683849 9781538683842 |
| EndPage | 147 |
| ExternalDocumentID | 8665746 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IL CBEJK RIE RIL |
| ID | FETCH-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673 |
| IEDL.DBID | RIE |
| IngestDate | Thu Jun 29 18:39:01 EDT 2023 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Language | English |
| LinkModel | DirectLink |
| MergedId | FETCHMERGED-LOGICAL-a158t-39fb01ee8a6d2ae23b0768ee6535063decbf97d3c2c1001b2aad8850bdcf22673 |
| PageCount | 13 |
| ParticipantIDs | ieee_primary_8665746 |
| PublicationCentury | 2000 |
| PublicationDate | 2018-Nov |
| PublicationDateYYYYMMDD | 2018-11-01 |
| PublicationDate_xml | – month: 11 year: 2018 text: 2018-Nov |
| PublicationDecade | 2010 |
| PublicationTitle | SC18: International Conference for High Performance Computing, Networking, Storage and Analysis |
| PublicationTitleAbbrev | SC |
| PublicationYear | 2018 |
| Publisher | IEEE |
| Publisher_xml | – name: IEEE |
| Score | 1.7982374 |
| Snippet | High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of... |
| SourceID | ieee |
| SourceType | Publisher |
| StartPage | 135 |
| SubjectTerms | Bandwidth Bioinformatics cache-aware optimizations distributed memory algorithms Genomics Hash functions Hash tables k-mer counting Memory management Prefetching Sequential analysis vectorization |
| Title | Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting |
| URI | https://ieeexplore.ieee.org/document/8665746 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NS8MwGA7b8OBJZRO_ycGjdW2TJu1RNscOOgtOGV5GPt7g0G3SboL-evN2Y_PgxVsIhMAbQp4kzwchlzw0oTHcBSoyDCPMwkBHwAORaOcRvr_mVmTM5zs5GKSjUZbXyNVGCwMAFfkMrrFZ_eXbuVniU1kbvdkkF3VSl1KutFpb28z2YweZWkiNDFGT8ysspTorenv_m2WftLaiO5pvjpMDUoNZk7w8-C09nXz7HoqUDJpvif60i6a3mFcFlt4jY_aL5qrAcJR32lflKx2iLqqkfgDtDm7oWzCFgnbW6RAt8tS7HXb6wToOwdcxSRcBy5wOI4BUCRsriJnGXzQAkbDEAw0LRrtMWmZig8ZKOlbKpmkSamucB1mSHZLGbD6DI0ItcBE5jw1N5hGb4NoZzbRwNpGWe0B3TJpYl_HHyvFivC7Jyd_dp2QXC79S6J2RxqJYwjnZMZ-LSVlcVMv0A80Slqg |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JSwMxFH7UKuhJpRV3c_Do2Fky21FaS8V2HLBK8VImyQsWbSszraC_3rxpaT148RYCIfBCyJfkWwAuuS1tKbm2Mkd6FGFmW8JBbgW-0Abhm2tuScZ87oZJEg0GcVqBq5UWBhFL8hleU7P8y1dTOaensgZ5s4U82IBNn3PXWai11saZjccmcbWIHGmTKudXXEp5WrR3_zfPHtTXsjuWrg6UfajgpAYvD2ZTj0ffpocRKYOla6o_a5HtLSVWoWI94sx-sTTLKR7lnXWy4pX1SRlVMDOAtZIb9maNMWfNZT5EHZ7at_1mx1oGIphK-tHM8mItbAcxygLlZuh6gv7REAPf8w3UUCiFjkPlSVeStZJws0xFkW8LJbWBWaF3ANXJdIKHwBTywNEGHcrYYLaACy2FJwKt_FBxA-mOoEZ1GX4sPC-Gy5Ic_919Adudfq877N4l9yewQ4uw0OudQnWWz_EMtuTnbFTk5-WS_QDQj5nv |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Optimizing+High+Performance+Distributed+Memory+Parallel+Hash+Tables+for+DNA+k-mer+Counting&rft.au=Pan%2C+Tony+C.&rft.au=Misra%2C+Sanchit&rft.au=Aluru%2C+Srinivas&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=135&rft.epage=147&rft_id=info:doi/10.1109%2FSC.2018.00014&rft.externalDocID=8665746 |