A Parallel Algorithm for N-Way Interval Set Intersection

The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Proceedings of the IEEE Jg. 105; H. 3; S. 542 - 551
Hauptverfasser: Layer, Ryan M., Quinlan, Aaron R.
Format: Journal Article
Sprache:Englisch
Veröffentlicht: United States IEEE 01.03.2017
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Schlagworte:
ISSN:0018-9219, 1558-2256
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:The comparison of sets of genome intervals (e.g., genes, repeats, ChIP-seq peaks) is essential to genome research, especially as modern sequencing technologies enable ever larger and more complex experiments. Relationships between genomic features are commonly identified by their intersection: that is, if feature sets contain overlapping intervals then it is inferred that they share a common biological function or origin. Using this technique, researchers identify genomic regions that are common among multiple (or unique to individuals) data sets. While there have been recent advances in algorithms for pairwise intersections between two sets of genomic intervals, few advances have been made to the intersection of many sets of genomic intervals. Identifying intersections among many interval sets is particularly important when attempting to distill biological insights from the massive, multidimensional data sets that are common to modern genome research. For such analyses, speed and efficiency are crucial, given the size and sheer number of data sets involved. To solve this problem, we present a novel "slice-then-sweep" algorithm that, given N interval sets, efficiently reveals the subset of intervals that are common to all N sets. We demonstrate that our algorithm is more efficient in the sequential case and has a vastly higher capacity for parallelization with a 19x speedup over the existing algorithm.
Bibliographie:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0018-9219
1558-2256
DOI:10.1109/JPROC.2015.2461494