Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters

Uloženo v:
Podrobná bibliografie
Název: Tightly-coupled hardware support to dynamic parallelism acceleration in embedded shared memory clusters
Autoři: BURGIO, PAOLO, TAGLIAVINI, GIUSEPPE, CONTI, FRANCESCO, MARONGIU, ANDREA, BENINI, LUCA
Zdroj: Design, Automation & Test in Europe Conference & Exhibition (DATE), 2014. :1-6
Informace o vydavateli: IEEE Conference Publications, 2014.
Rok vydání: 2014
Témata: Engineering controlled terms: Application programming interfaces (API), Benchmarking, Computer programming, Design, Embedded systems, Hardware, Parallel architectures, Semantics Cluster-based architecture, Fine-grained parallelism, Hardware implementations, Modern applications, Multi-core cluster, Programming abstractions, Runtime environments, Synthetic benchmark Engineering main heading: Scheduling, 0202 electrical engineering, electronic engineering, information engineering, 02 engineering and technology
Popis: Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
Druh dokumentu: Article
Conference object
Popis souboru: application/pdf
DOI: 10.7873/date.2014.169
DOI: 10.7873/date2014.169
DOI: 10.5555/2616606.2616798
Přístupová URL adresa: https://cris.unibo.it/handle/11585/525125
https://doi.org/10.7873/DATE.2014.169
https://ieeexplore.ieee.org/document/6800370/
https://dl.acm.org/doi/10.5555/2616606.2616798
https://dblp.uni-trier.de/db/conf/date/date2014.html#BurgioTCMB14
http://dblp.uni-trier.de/db/conf/date/date2014.html#BurgioTCMB14
https://hdl.handle.net/11380/1171893
https://doi.org/10.7873/DATE2014.169
https://hdl.handle.net/11585/525125
https://doi.org/10.7873/DATE.2014.169
Přístupové číslo: edsair.doi.dedup.....3c0ac4736dab3c1e7df54e4cdec29a0c
Databáze: OpenAIRE
Popis
Abstrakt:Modern designs for embedded systems are increasingly embracing cluster-based architectures, where small sets of cores communicate through tightly-coupled shared memory banks and high-performance interconnections. At the same time, the complexity of modern applications requires new programming abstractions to exploit dynamic and/or irregular parallelism on such platforms. Supporting dynamic parallelism in systems which i) are resource-constrained and ii) run applications with small units of work calls for a runtime environment which has minimal overhead for the scheduling of parallel tasks. In this work, we study the major sources of overhead in the implementation of OpenMP dynamic loops, sections and tasks, and propose a hardware implementation of a generic Scheduling Engine (HWSE) which fits the semantics of the three constructs. The HWSE is designed as a tightly-coupled block to the PEs within a multi-core cluster, communicating through a shared-memory interface. This allows very fast programming and synchronization with the controlling PEs, fundamental to achieving fast dynamic scheduling, and ultimately to enable fine-grained parallelism. We prove the effectiveness of our solutions with real applications and synthetic benchmarks, using a cycle-accurate virtual platform.
DOI:10.7873/date.2014.169