SciHadoop array-based query processing in Hadoop

Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, arr...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) S. 1 - 11
Hauptverfasser: Buck, Joe B., Watkins, Noah, LeFevre, Jeff, Ioannidou, Kleoni, Maltzahn, Carlos, Polyzotis, Neoklis, Brandt, Scott
Format: Tagungsbericht
Sprache:Englisch
Veröffentlicht: New York, NY, USA ACM 12.11.2011
IEEE
Schriftenreihe:ACM Conferences
Schlagworte:
ISBN:145030771X, 9781450307710
ISSN:2167-4329
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Abstract Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
AbstractList Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network.
Author LeFevre, Jeff
Ioannidou, Kleoni
Buck, Joe B.
Maltzahn, Carlos
Brandt, Scott
Polyzotis, Neoklis
Watkins, Noah
Author_xml – sequence: 1
  givenname: Joe B.
  surname: Buck
  fullname: Buck, Joe B.
  email: buck@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 2
  givenname: Noah
  surname: Watkins
  fullname: Watkins, Noah
  email: jayhawk@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 3
  givenname: Jeff
  surname: LeFevre
  fullname: LeFevre, Jeff
  email: jlefevre@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 4
  givenname: Kleoni
  surname: Ioannidou
  fullname: Ioannidou, Kleoni
  email: kleoni@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 5
  givenname: Carlos
  surname: Maltzahn
  fullname: Maltzahn, Carlos
  email: carlosm@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 6
  givenname: Neoklis
  surname: Polyzotis
  fullname: Polyzotis, Neoklis
  email: alkis@cs.ucsc.edu
  organization: UC Santa Cruz
– sequence: 7
  givenname: Scott
  surname: Brandt
  fullname: Brandt, Scott
  email: scott@cs.ucsc.edu
  organization: UC Santa Cruz
BookMark eNqNjzFPwzAQhY0oEmnpzMAfYEm4sx27HlEFLVIlhhaJzTrbZylAmyph4d-TqhkYmZ7uvad7-qZicmgPLMQtQoWo6wcJRqmFrk6qrboQ08EFBdbi--XfYyIKicaWWkl3LeZ9_wEAiLKu0RSi2MZmTaltjzfiKtNXz_NRZ-Lt-Wm3XJeb19XL8nFTEhr5XRI5TCoqG2LWEkJONpk4zEnOUTO7ZMlgAMacXSaSDgxGWfOCclbEaibuzn8bZvbHrtlT9-PNAKVrHNL7c0px70PbfvYewZ-I_UjsR-KhWv2z6kPXcFa_jN1SOg
ContentType Conference Proceeding
Copyright 2011 ACM
Copyright_xml – notice: 2011 ACM
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/2063384.2063473
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 145030771X
9781450307710
EndPage 11
ExternalDocumentID 6114451
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
IEGSK
IERZE
OCL
RIB
RIC
RIE
RIL
6IH
AAWTH
ABLEC
ADZIZ
CHZPO
IPLJI
ID FETCH-LOGICAL-a162t-aa91d3c37bcf420bfd7d6c1452efc4ee9d7a61b0e1ff9faa29061c25e8aff3ae3
IEDL.DBID RIE
ISBN 145030771X
9781450307710
ISSN 2167-4329
IngestDate Wed Aug 27 03:18:43 EDT 2025
Wed Jan 31 06:47:53 EST 2024
IsPeerReviewed false
IsScholarly false
Keywords query optimization
scientific file-formats
data intensive
map reduce
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel DirectLink
MeetingName SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis
MergedId FETCHMERGED-LOGICAL-a162t-aa91d3c37bcf420bfd7d6c1452efc4ee9d7a61b0e1ff9faa29061c25e8aff3ae3
PageCount 11
ParticipantIDs acm_books_10_1145_2063384_2063473_brief
acm_books_10_1145_2063384_2063473
ieee_primary_6114451
PublicationCentury 2000
PublicationDate 20111112
2011-Nov.
PublicationDateYYYYMMDD 2011-11-12
2011-11-01
PublicationDate_xml – month: 11
  year: 2011
  text: 20111112
  day: 12
PublicationDecade 2010
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
PublicationTitleAbbrev SC
PublicationYear 2011
Publisher ACM
IEEE
Publisher_xml – name: ACM
– name: IEEE
SSID ssj0001125516
ssj0003204180
Score 1.8230875
Snippet Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However,...
SourceID ieee
acm
SourceType Publisher
StartPage 1
SubjectTerms Arrays
Data analysis
Data intensive
Data models
Human-centered computing -- Visualization -- Visualization application domains -- Scientific visualization
Information systems -- Data management systems -- Database management system engines -- Database query processing
Information systems -- Information retrieval -- Search engine architectures and scalability -- Distributed retrieval
Information systems -- Information retrieval -- Search engine architectures and scalability -- Peer-to-peer retrieval
Information systems -- Information storage systems -- Storage architectures -- Distributed storage
Information systems -- Information systems applications
Layout
Libraries
map reduce
Optimization
query optimization
scientific file-formats
Semantics
Theory of computation -- Theory and algorithms for application domains -- Database theory -- Database query processing and optimization (theory)
Subtitle array-based query processing in Hadoop
Title SciHadoop
URI https://ieeexplore.ieee.org/document/6114451
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA7b8ODJH5s4fxFB8GK1adKm9Sbi8CBjB5XdSpq8wA62o9uE_fe-pHVDEMRbWkJpv_743vfS7z1CrrjUKUflEfDQpoEwKFBSE-nAWiMB-afQPnXx_iLH43Q6zSYdcrPxwgCA__kMbt3Qr-WbSq9cquwuweBdOL90V0rZeLW2-RRk6rgNddw2j0LBfOO0yJf25lHWVvZhIkbJn6A2czmVhAvXM72r9MePBiueX0Z7_zuzfTLYGvXoZENBB6QD5SHZ--7UQNsXt08YjvAbU1Xze_pQ12odOPYyFEmhXtN5YxbAI9BZSZt5A_I2enp9fA7aZgmBYkm0DJTKmOGay0JbEYUFYm0SjdcagdUCIDNSJawIgVmbWaVcmXemoxhSZS1XwI9Ir6xKOCYUoxQl8YYlIuNCY8RiwqyIOcQgeKJEOiSXiFbuVMAib4zNcd4imreIDsn1n3Pyop6BHZK-wzOfN9U18hbKk993n5Jdn9X1bsAz0lvWKzgnO_pzOVvUF_6R-AIyPa3t
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB7qA_Tkq2J9RhC8GM0-8vImYqlYiweV3sJmdxZ6sC2xCv33zm6iIgjibROWkHx5fPPN5psBOBGpzgQpj1BENgulIYGSGa5Da02KxD-l9qmL5346GGTDYf7QgrMvLwwi-p_P8NwN_Vq-meg3lyq7SCh4l84vvRRLyVnt1vrOqBBXx02w47YFjyTzrdO4L-4teN7U9mEyJtGfkDpzWZVESNc1fUHplx8tVjzDdNf-d27r0P626gUPXyS0AS0cb8LaZ6-GoHl1t4DRiL4yk8n0MriqKjUPHX-ZgGihmgfT2i5ARwhG46Ce14an7s3jdS9s2iWEiiV8FiqVMyO0SEttJY9KQtskmq6Vo9USMTepSlgZIbM2t0q5Qu9M8xgzZa1QKLZhcTwZ4w4EFKeolG5ZInMhNcUsJsrLWGCMUiRKZh04JrQKpwNei9raHBcNokWDaAdO_5xTlNUIbQe2HJ7FtK6vUTRQ7v6--whWeo_3_aJ_O7jbg1Wf4_XewH1YnFVveADL-n02eq0O_ePxAXTzsTQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+2011+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=SciHadoop&rft.au=Buck%2C+Joe+B.&rft.au=Watkins%2C+Noah&rft.au=LeFevre%2C+Jeff&rft.au=Ioannidou%2C+Kleoni&rft.series=ACM+Conferences&rft.date=2011-11-12&rft.pub=ACM&rft.isbn=145030771X&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1145%2F2063384.2063473
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2167-4329&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2167-4329&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2167-4329&client=summon