SciHadoop array-based query processing in Hadoop
Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, arr...
Gespeichert in:
| Veröffentlicht in: | 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) S. 1 - 11 |
|---|---|
| Hauptverfasser: | , , , , , , |
| Format: | Tagungsbericht |
| Sprache: | Englisch |
| Veröffentlicht: |
New York, NY, USA
ACM
12.11.2011
IEEE |
| Schriftenreihe: | ACM Conferences |
| Schlagworte: |
Human-centered computing
> Visualization
> Visualization application domains
> Scientific visualization
Information systems
> Data management systems
> Database management system engines
> Database query processing
Information systems
> Information retrieval
> Search engine architectures and scalability
> Distributed retrieval
Information systems
> Information retrieval
> Search engine architectures and scalability
> Peer-to-peer retrieval
|
| ISBN: | 145030771X, 9781450307710 |
| ISSN: | 2167-4329 |
| Online-Zugang: | Volltext |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Abstract | Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci-Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network. |
|---|---|
| AbstractList | Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However, Hadoop's byte stream data model causes inefficiencies when used to process scientific data that is commonly stored in highly-structured, array-based binary file formats resulting in limited scalability of Hadoop applications in science. We introduce Sci- Hadoop, a Hadoop plugin allowing scientists to specify logical queries over array-based data models. Sci-Hadoop executes queries as map/reduce programs defined over the logical data model. We describe the implementation of a Sci-Hadoop prototype for NetCDF data sets and quantify the performance of five separate optimizations that address the following goals for several representative aggregate queries: reduce total data transfers, reduce remote reads, and reduce unnecessary reads. Two optimizations allow holistic aggregate queries to be evaluated opportunistically during the map phase; two additional optimizations intelligently partition input data to increase read locality, and one optimization avoids block scans by examining the data dependencies of an executing query to prune input partitions. Experiments involving a holistic function show run-time improvements of up to 8x, with drastic reductions of IO, both locally and over the network. |
| Author | LeFevre, Jeff Ioannidou, Kleoni Buck, Joe B. Maltzahn, Carlos Brandt, Scott Polyzotis, Neoklis Watkins, Noah |
| Author_xml | – sequence: 1 givenname: Joe B. surname: Buck fullname: Buck, Joe B. email: buck@cs.ucsc.edu organization: UC Santa Cruz – sequence: 2 givenname: Noah surname: Watkins fullname: Watkins, Noah email: jayhawk@cs.ucsc.edu organization: UC Santa Cruz – sequence: 3 givenname: Jeff surname: LeFevre fullname: LeFevre, Jeff email: jlefevre@cs.ucsc.edu organization: UC Santa Cruz – sequence: 4 givenname: Kleoni surname: Ioannidou fullname: Ioannidou, Kleoni email: kleoni@cs.ucsc.edu organization: UC Santa Cruz – sequence: 5 givenname: Carlos surname: Maltzahn fullname: Maltzahn, Carlos email: carlosm@cs.ucsc.edu organization: UC Santa Cruz – sequence: 6 givenname: Neoklis surname: Polyzotis fullname: Polyzotis, Neoklis email: alkis@cs.ucsc.edu organization: UC Santa Cruz – sequence: 7 givenname: Scott surname: Brandt fullname: Brandt, Scott email: scott@cs.ucsc.edu organization: UC Santa Cruz |
| BookMark | eNqNjzFPwzAQhY0oEmnpzMAfYEm4sx27HlEFLVIlhhaJzTrbZylAmyph4d-TqhkYmZ7uvad7-qZicmgPLMQtQoWo6wcJRqmFrk6qrboQ08EFBdbi--XfYyIKicaWWkl3LeZ9_wEAiLKu0RSi2MZmTaltjzfiKtNXz_NRZ-Lt-Wm3XJeb19XL8nFTEhr5XRI5TCoqG2LWEkJONpk4zEnOUTO7ZMlgAMacXSaSDgxGWfOCclbEaibuzn8bZvbHrtlT9-PNAKVrHNL7c0px70PbfvYewZ-I_UjsR-KhWv2z6kPXcFa_jN1SOg |
| ContentType | Conference Proceeding |
| Copyright | 2011 ACM |
| Copyright_xml | – notice: 2011 ACM |
| DBID | 6IE 6IL CBEJK RIE RIL |
| DOI | 10.1145/2063384.2063473 |
| DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE/IET Electronic Library IEEE Proceedings Order Plans (POP All) 1998-Present |
| DatabaseTitleList | |
| Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher |
| DeliveryMethod | fulltext_linktorsrc |
| Discipline | Computer Science |
| EISBN | 145030771X 9781450307710 |
| EndPage | 11 |
| ExternalDocumentID | 6114451 |
| Genre | orig-research |
| GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI IEGSK IERZE OCL RIB RIC RIE RIL 6IH AAWTH ABLEC ADZIZ CHZPO IPLJI |
| ID | FETCH-LOGICAL-a162t-aa91d3c37bcf420bfd7d6c1452efc4ee9d7a61b0e1ff9faa29061c25e8aff3ae3 |
| IEDL.DBID | RIE |
| ISBN | 145030771X 9781450307710 |
| ISSN | 2167-4329 |
| IngestDate | Wed Aug 27 03:18:43 EDT 2025 Wed Jan 31 06:47:53 EST 2024 |
| IsPeerReviewed | false |
| IsScholarly | false |
| Keywords | query optimization scientific file-formats data intensive map reduce |
| Language | English |
| License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org |
| LinkModel | DirectLink |
| MeetingName | SC '11: International Conference for High Performance Computing, Networking, Storage and Analysis |
| MergedId | FETCHMERGED-LOGICAL-a162t-aa91d3c37bcf420bfd7d6c1452efc4ee9d7a61b0e1ff9faa29061c25e8aff3ae3 |
| PageCount | 11 |
| ParticipantIDs | acm_books_10_1145_2063384_2063473_brief acm_books_10_1145_2063384_2063473 ieee_primary_6114451 |
| PublicationCentury | 2000 |
| PublicationDate | 20111112 2011-Nov. |
| PublicationDateYYYYMMDD | 2011-11-12 2011-11-01 |
| PublicationDate_xml | – month: 11 year: 2011 text: 20111112 day: 12 |
| PublicationDecade | 2010 |
| PublicationPlace | New York, NY, USA |
| PublicationPlace_xml | – name: New York, NY, USA |
| PublicationSeriesTitle | ACM Conferences |
| PublicationTitle | 2011 International Conference for High Performance Computing, Networking, Storage and Analysis (SC) |
| PublicationTitleAbbrev | SC |
| PublicationYear | 2011 |
| Publisher | ACM IEEE |
| Publisher_xml | – name: ACM – name: IEEE |
| SSID | ssj0001125516 ssj0003204180 |
| Score | 1.8230875 |
| Snippet | Hadoop has become the de facto platform for large-scale data analysis in commercial applications, and increasingly so in scientific applications. However,... |
| SourceID | ieee acm |
| SourceType | Publisher |
| StartPage | 1 |
| SubjectTerms | Arrays Data analysis Data intensive Data models Human-centered computing -- Visualization -- Visualization application domains -- Scientific visualization Information systems -- Data management systems -- Database management system engines -- Database query processing Information systems -- Information retrieval -- Search engine architectures and scalability -- Distributed retrieval Information systems -- Information retrieval -- Search engine architectures and scalability -- Peer-to-peer retrieval Information systems -- Information storage systems -- Storage architectures -- Distributed storage Information systems -- Information systems applications Layout Libraries map reduce Optimization query optimization scientific file-formats Semantics Theory of computation -- Theory and algorithms for application domains -- Database theory -- Database query processing and optimization (theory) |
| Subtitle | array-based query processing in Hadoop |
| Title | SciHadoop |
| URI | https://ieeexplore.ieee.org/document/6114451 |
| hasFullText | 1 |
| inHoldings | 1 |
| isFullTextHit | |
| isPrint | |
| link | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA7b8ODJH5s4fxFB8GK1adKm9Sbi8CBjB5XdSpq8wA62o9uE_fe-pHVDEMRbWkJpv_743vfS7z1CrrjUKUflEfDQpoEwKFBSE-nAWiMB-afQPnXx_iLH43Q6zSYdcrPxwgCA__kMbt3Qr-WbSq9cquwuweBdOL90V0rZeLW2-RRk6rgNddw2j0LBfOO0yJf25lHWVvZhIkbJn6A2czmVhAvXM72r9MePBiueX0Z7_zuzfTLYGvXoZENBB6QD5SHZ--7UQNsXt08YjvAbU1Xze_pQ12odOPYyFEmhXtN5YxbAI9BZSZt5A_I2enp9fA7aZgmBYkm0DJTKmOGay0JbEYUFYm0SjdcagdUCIDNSJawIgVmbWaVcmXemoxhSZS1XwI9Ir6xKOCYUoxQl8YYlIuNCY8RiwqyIOcQgeKJEOiSXiFbuVMAib4zNcd4imreIDsn1n3Pyop6BHZK-wzOfN9U18hbKk993n5Jdn9X1bsAz0lvWKzgnO_pzOVvUF_6R-AIyPa3t |
| linkProvider | IEEE |
| linkToHtml | http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LS8NAEB7qA_Tkq2J9RhC8GM0-8vImYqlYiweV3sJmdxZ6sC2xCv33zm6iIgjibROWkHx5fPPN5psBOBGpzgQpj1BENgulIYGSGa5Da02KxD-l9qmL5346GGTDYf7QgrMvLwwi-p_P8NwN_Vq-meg3lyq7SCh4l84vvRRLyVnt1vrOqBBXx02w47YFjyTzrdO4L-4teN7U9mEyJtGfkDpzWZVESNc1fUHplx8tVjzDdNf-d27r0P626gUPXyS0AS0cb8LaZ6-GoHl1t4DRiL4yk8n0MriqKjUPHX-ZgGihmgfT2i5ARwhG46Ce14an7s3jdS9s2iWEiiV8FiqVMyO0SEttJY9KQtskmq6Vo9USMTepSlgZIbM2t0q5Qu9M8xgzZa1QKLZhcTwZ4w4EFKeolG5ZInMhNcUsJsrLWGCMUiRKZh04JrQKpwNei9raHBcNokWDaAdO_5xTlNUIbQe2HJ7FtK6vUTRQ7v6--whWeo_3_aJ_O7jbg1Wf4_XewH1YnFVveADL-n02eq0O_ePxAXTzsTQ |
| openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+2011+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=SciHadoop&rft.au=Buck%2C+Joe+B.&rft.au=Watkins%2C+Noah&rft.au=LeFevre%2C+Jeff&rft.au=Ioannidou%2C+Kleoni&rft.series=ACM+Conferences&rft.date=2011-11-12&rft.pub=ACM&rft.isbn=145030771X&rft.spage=1&rft.epage=11&rft_id=info:doi/10.1145%2F2063384.2063473 |
| thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2167-4329&client=summon |
| thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2167-4329&client=summon |
| thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2167-4329&client=summon |

