Parallel Computation of the Burrows-Wheeler Transform of Short Reads Using Prefix Parallelism

The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based met...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:IEEE/ACM transactions on computational biology and bioinformatics Ročník 16; číslo 1; s. 3 - 13
Hlavní autori: Kimura, Kouichi, Koike, Asako
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: United States IEEE 01.01.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Predmet:
ISSN:1545-5963, 1557-9964, 1557-9964
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:The Burrows-Wheeler transform (BWT) of short-read data has unexplored potential utilities, such as for efficient and sensitive variation analysis against multiple reference genome sequences, because it does not depend on any particular reference genome sequence, unlike conventional mapping-based methods. However, since the amount of read data is generally much larger than the size of the reference sequence, computation of the BWT of reads is not easy, and this hampers development of potential applications. For the alleviation of this problem, a new method of computing the BWT of reads in parallel is proposed. The BWT, corresponding to a sorted list of suffixes of reads, is constructed incrementally by successively including longer and longer suffixes. The working data is divided into more than 10,000 "blocks" corresponding to sublists of suffixes with the same prefixes. Thousands of groups of blocks can be processed in parallel while making exclusive writes and concurrent reads into a shared memory. Reads and writes are basically sequential, and the read concurrency is limited to two. Thus, a fine-grained parallelism, referred to as prefix parallelism , is expected to work efficiently. The time complexity for processing <inline-formula><tex-math notation="LaTeX">n</tex-math> <inline-graphic xlink:href="kimura-ieq1-2837749.gif"/> </inline-formula> reads of length <inline-formula><tex-math notation="LaTeX">\ell</tex-math> <inline-graphic xlink:href="kimura-ieq2-2837749.gif"/> </inline-formula> is <inline-formula><tex-math notation="LaTeX">O(n\ell ^2)</tex-math> <inline-graphic xlink:href="kimura-ieq3-2837749.gif"/> </inline-formula>. On actual biological DNA sequence data of about 100 Gbp with a read length of 100 bp (base pairs), a tentative implementation of the proposed method took less than an hour on a single-node computer; i.e., it was about three times faster than one of the fastest programs developed so far.
Bibliografia:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:1545-5963
1557-9964
1557-9964
DOI:10.1109/TCBB.2018.2837749