SciHadoop array-based query processing in Hadoop

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, arr...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) s. 1 - 11
Hlavní autori: Buck, Joe B., Watkins, Noah, LeFevre, Jeff, Ioannidou, Kleoni, Maltzahn, Carlos, Polyzotis, Neoklis, Brandt, Scott
Médium: Konferenčný príspevok..
Jazyk:English
Vydavateľské údaje: New York, NY, USA ACM 12.11.2011
IEEE
Edícia:ACM Conferences
Predmet:
ISBN:145030771X, 9781450307710
ISSN:2167-4329
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Popis
Shrnutí:Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
ISBN:145030771X
9781450307710
ISSN:2167-4329
DOI:10.1145/2063384.2063473