A comparison of approaches to large-scale data analysis

Gespeichert in:
Bibliographische Detailangaben
Titel: A comparison of approaches to large-scale data analysis
Autoren: Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel J. Abadi, David J. Dewitt, Samuel Madden, Michael Stonebraker
Weitere Verfasser: The Pennsylvania State University CiteSeerX Archives
Quelle: http://www.cs.ucf.edu/%7Ekienhua/classes/COP5711/Papers/Large-Scale%20Data%20Analysis-sigmod09.pdf.
Publikationsjahr: 2009
Bestand: CiteSeerX
Schlagwörter: General Terms Database Applications, Use Cases, Database Programming
Beschreibung: There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.
Publikationsart: text
Dateibeschreibung: application/pdf
Sprache: English
Relation: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.188.3094
Verfügbarkeit: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.188.3094
http://www.cs.ucf.edu/%7Ekienhua/classes/COP5711/Papers/Large-Scale%20Data%20Analysis-sigmod09.pdf
Rights: Metadata may be used without restrictions as long as the oai identifier remains attached to it.
Dokumentencode: edsbas.7848F4CC
Datenbank: BASE
Beschreibung
Abstract:There is currently considerable enthusiasm around the MapReduce (MR) paradigm for large-scale data analysis [17]. Although the basic control flow of this framework has existed in parallel SQL database management systems (DBMS) for over 20 years, some have called MR a dramatically new computing model [8, 17]. In this paper, we describe and compare both paradigms. Furthermore, we evaluate both kinds of systems in terms of performance and development complexity. To this end, we define a benchmark consisting of a collection of tasks that we have run on an open source version of MR as well as on two parallel DBMSs. For each task, we measure each system’s performance for various degrees of parallelism on a cluster of 100 nodes. Our results reveal some interesting trade-offs. Although the process to load data into and tune the execution of parallel DBMSs took much longer than the MR system, the observed performance of these DBMSs was strikingly better. We speculate about the causes of the dramatic performance difference and consider implementation concepts that future systems should take from both kinds of architectures.