Runtime-guided management of stacked DRAM memories in task parallel programs

Saved in:
Bibliographic Details
Title: Runtime-guided management of stacked DRAM memories in task parallel programs
Authors: Álvarez Martí, Lluc, Casas, Marc, Labarta Mancho, Jesús José, Ayguadé Parra, Eduard, Valero Cortés, Mateo, Moretó Planas, Miquel
Contributors: Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors, Barcelona Supercomputing Center, Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Publisher Information: Association for Computing Machinery (ACM)
Publication Year: 2018
Collection: Universitat Politècnica de Catalunya, BarcelonaTech: UPCommons - Global access to UPC knowledge
Subject Terms: Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors, High performance computing, Memory management (Computer science), Runtime systems, Stacked DRAM memories, Task-based data-flow programming models, Data flow analysis, Hardware, Intelligent control, Memory architecture, Application codes, Complex hardware, Dataflow programming, Dram memory, HPC, Runtime approach, Task-based programming, Dynamic random access storage, Càlcul intensiu (Informàtica), Gestió de memòria (Informàtica)
Description: Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache. ; This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. ; Peer Reviewed ; Postprint (author's final draft)
Document Type: conference object
File Description: 11 p.; application/pdf
Language: English
Relation: https://dl.acm.org/citation.cfm?id=3205312; info:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1272; info:eu-repo/grantAgreement/AGAUR/V PRI/2014 SGR 1051; info:eu-repo/grantAgreement/MINECO//TIN2015-65316-P/ES/COMPUTACION DE ALTAS PRESTACIONES VII/; info:eu-repo/grantAgreement/AEI/RYC-2016-21104; info:eu-repo/grantAgreement/EC/FP7/321253/EU/Riding on Moore's Law/ROMOL; info:eu-repo/grantAgreement/EC/H2020/779877/EU/Mont-Blanc 2020, European scalable, modular and power efficient HPC processor/Mont-Blanc 2020; info:eu-repo/grantAgreement/MINECO/PE2013-2016/RYC-2016-21104; https://hdl.handle.net/2117/125344
DOI: 10.1145/3205289.3205312
Availability: https://hdl.handle.net/2117/125344
https://doi.org/10.1145/3205289.3205312
Rights: Open Access
Accession Number: edsbas.B5474C53
Database: BASE
Description
Abstract:Stacked DRAM memories have become a reality in High-Performance Computing (HPC) architectures. These memories provide much higher bandwidth while consuming less power than traditional off-chip memories, but their limited memory capacity is insufficient for modern HPC systems. For this reason, both stacked DRAM and off-chip memories are expected to co-exist in HPC architectures, giving raise to different approaches for architecting the stacked DRAM in the system. This paper proposes a runtime approach to transparently manage stacked DRAM memories in task-based programming models. In this approach the runtime system is in charge of copying the data accessed by the tasks to the stacked DRAM, without any complex hardware support nor modifications to the application code. To mitigate the cost of copying data between the stacked DRAM and the off-chip memory, the proposal includes an optimization to parallelize the copies across idle or additional helper threads. In addition, the runtime system is aware of the reuse pattern of the data accessed by the tasks, and can exploit this information to avoid unworthy copies of data to the stacked DRAM. Results on the Intel Knights Landing processor show that the proposed techniques achieve an average speedup of 14% against the state-of-the-art library to manage the stacked DRAM and 29% against a stacked DRAM architected as a hardware cache. ; This work has been supported by the RoMoL ERC Advanced Grant (GA 321253), by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), by the Generalitat de Catalunya (contracts 2014-SGR-1051 and 2014-SGR-1272) and by the European Union’s Horizon 2020 research and innovation programme (grant agreement 779877). M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. ; Peer Reviewed ; Postprint (author's final draft)
DOI:10.1145/3205289.3205312