A general and fast distributed system for large-scale dynamic programming applications

•Propose a DAG based Parallel DP Programming model, consisting of a DAG pattern library and a vertex compute method.•Develop a general and fast distributed framework called DPX10 based on proposed programming model.•Propose and implement a new comprehensive straggler mitigation strategy in DPX10 for...

Ausführliche Beschreibung

Gespeichert in:
Bibliographische Detailangaben
Veröffentlicht in:Parallel computing Jg. 60; S. 1 - 21
Hauptverfasser: Wang, Chen, Yu, Ce, Tang, Shanjiang, Xiao, Jian, Sun, Jizhou, Meng, Xiangfei
Format: Journal Article
Sprache:Englisch
Veröffentlicht: Elsevier B.V 01.12.2016
Schlagworte:
ISSN:0167-8191, 1872-7336
Online-Zugang:Volltext
Tags: Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
Beschreibung
Zusammenfassung:•Propose a DAG based Parallel DP Programming model, consisting of a DAG pattern library and a vertex compute method.•Develop a general and fast distributed framework called DPX10 based on proposed programming model.•Propose and implement a new comprehensive straggler mitigation strategy in DPX10 for DP computation.•Demonstrate and validate the simplicity, efficiency and scalability of DPX10 with four large scale DP applications experimentally. Dynamic programming is an important technique widely used in many scientific applications. Due to the massive volume of applications’ data in practice, parallel and distributed DP is a must. However, writing a parallel and distributed DP program is difficult and error-prone because of its intrinsically strong data dependency. In this paper, we present DPX10, a DAG-based distributed X10 framework aiming at simplifying the parallel programming for DP applications. DPX10 enables users to write highly efficient parallel DP programs without much effort. For DPX10 programming, users only need to do two things: 1) Instantiating a DAG pattern by indicating the dependency between vertices of the DAG; 2) Implementing the DP application’s logic in the compute method of the vertices. DPX10 provides eight commonly used DAG patterns and a simple API to allow users to customize their own DAG patterns. All the tiresome work of DP parallelization including DAG distribution, tasks scheduling, and tasks communication are hidden from users and covered by DPX10. Moreover, DPX10 is fault-tolerant and has a mechanism to handle the problem of straggler tasks, which run much slower than other tasks due to unexpected resource contention. Finally, we use four DP applications with up to 2 billion vertices running on 240 cores to demonstrate the simplicity, efficiency, and scalability of our proposed framework.
ISSN:0167-8191
1872-7336
DOI:10.1016/j.parco.2016.09.002