A Parallel Algorithm for N-Way Interval Set Intersection

The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that...

Full description

Saved in:

Bibliographic Details
Published in:	Proceedings of the IEEE Vol. 105; no. 3; pp. 542 - 551
Main Authors:	Layer, Ryan M., Quinlan, Aaron R.
Format:	Journal Article
Language:	English
Published:	United States IEEE 01.03.2017 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects:	Algorithm design and analysis Algorithms Bioinformatics computational biology Context Datasets Gene sequencing genome analysis Genomes genomic interval intersection Genomics Intersections Intervals Multidimensional data Online analytical processing parallel algorithm Parallel algorithms Parallel processing Partitioning algorithms computational biology Genomic interval intersection genome analysis parallel algorithm bioinformatics
ISSN:	0018-9219, 1558-2256
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that is, if feature sets contain overlapping intervals then it is inferred that they share a common biological function or origin. Using this technique, researchers identify genomic regions that are common among multiple (or unique to individuals) data sets. While there have been recent advances in algorithms for pairwise intersections between two sets of genomic intervals, few advances have been made to the intersection of many sets of genomic intervals. Identifying intersections among many interval sets is particularly important when attempting to distill biological insights from the massive, multidimensional data sets that are common to modern genome research. For such analyses, speed and efficiency are crucial, given the size and sheer number of data sets involved. To solve this problem, we present a novel "slice-then-sweep" algorithm that, given N interval sets, efficiently reveals the subset of intervals that are common to all N sets. We demonstrate that our algorithm is more efficient in the sequential case and has a vastly higher capacity for parallelization with a 19x speedup over the existing algorithm.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0018-9219 1558-2256
DOI:	10.1109/JPROC.2015.2461494