SciHadoop array-based query processing in Hadoop

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, arr...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) s. 1 - 11
Hlavní autori:	Buck, Joe B., Watkins, Noah, LeFevre, Jeff, Ioannidou, Kleoni, Maltzahn, Carlos, Polyzotis, Neoklis, Brandt, Scott
Médium:	Konferenčný príspevok..
Jazyk:	English
Vydavateľské údaje:	New York, NY, USA ACM 12.11.2011 IEEE
Edícia:	ACM Conferences
Predmet:	Arrays Data analysis Data intensive Data models Human-centered computing > Visualization > Visualization application domains > Scientific visualization Information systems > Data management systems > Database management system engines > Database query processing Information systems > Information retrieval > Search engine architectures and scalability > Distributed retrieval Information systems > Information retrieval > Search engine architectures and scalability > Peer-to-peer retrieval Information systems > Information storage systems > Storage architectures > Distributed storage Information systems > Information systems applications Layout Libraries map reduce Optimization query optimization scientific file-formats Semantics Theory of computation > Theory and algorithms for application domains > Database theory > Database query processing and optimization (theory) query optimization scientific file-formats data intensive map reduce
ISBN:	145030771X, 9781450307710
ISSN:	2167-4329
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Popis
Shrnutí:	Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
ISBN:	145030771X 9781450307710
ISSN:	2167-4329
DOI:	10.1145/2063384.2063473