Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting

High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplis...

Full description

Saved in:
Bibliographic Details
Published in:SC18: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 135 - 147
Main Authors: Pan, Tony C., Misra, Sanchit, Aluru, Srinivas
Format: Conference Proceeding
Language:English
Published: IEEE 01.11.2018
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:High-throughput DNA sequencing is the mainstay of modern genomics research. A common operation used in bioinformatic analysis for many applications of high-throughput sequencing is the counting and indexing of fixed length substrings of DNA sequences called k-mers. Counting k-mers is often accomplished via hashing, and distributed memory k-mer counting algorithms for large datasets are memory access and network communication bound. In this work, we present two optimized distributed parallel hash table techniques that utilize cache friendly algorithms for local hashing, overlapped communication and computation to hide communication costs, and vectorized hash functions that are specialized for fc-mer and other short key indices. On 4096 cores of the NERSC Cori supercomputer, our implementation completed index construction and query on an approximately 1 TB human genome dataset in just 11.8 seconds and 5.8 seconds, demonstrating speedups of 2.06× and 3.7×, respectively, over the previous state-of-the-art distributed memory k-mer counter.
DOI:10.1109/SC.2018.00014