Parallel programming models for heterogeneous many-cores: a comprehensive survey

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitabl...

Celý popis

Uložené v:

Podrobná bibliografia
Vydané v:	CCF transactions on high performance computing (Online) Ročník 2; číslo 4; s. 382 - 400
Hlavní autori:	Fang, Jianbin, Huang, Chun, Tang, Tao, Wang, Zheng
Médium:	Journal Article
Jazyk:	English
Vydavateľské údaje:	Singapore Springer Singapore 01.12.2020 Springer Nature B.V
Predmet:	Algorithms Assembly language Communication Computation Computer Hardware Computer Science Computer Systems Organization and Communication Networks Linear algebra Machine learning Parallel programming Programmers Software Supercomputers Survey Paper Heterogeneous computing Parallel programming models Many-core architectures
ISSN:	2524-4922, 2524-4930
On-line prístup:	Získať plný text
Tagy:	Pridať tag Žiadne tagy, Buďte prvý, kto otaguje tento záznam!

Abstract	Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
AbstractList	Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
Author	Fang, Jianbin Tang, Tao Huang, Chun Wang, Zheng
Author_xml	– sequence: 1 givenname: Jianbin surname: Fang fullname: Fang, Jianbin organization: Institute for Computer Systems, College of Computer, National University of Defense Technology – sequence: 2 givenname: Chun surname: Huang fullname: Huang, Chun email: chunhuang@nudt.edu.cn organization: Institute for Computer Systems, College of Computer, National University of Defense Technology – sequence: 3 givenname: Tao surname: Tang fullname: Tang, Tao organization: Institute for Computer Systems, College of Computer, National University of Defense Technology – sequence: 4 givenname: Zheng surname: Wang fullname: Wang, Zheng organization: School of Computing, University of Leeds
BookMark	eNp9kE1LAzEQhoNUsNb-AU8Lnlcnk-yXNyl-QcEe9BzS7Gy7ZTepybbQf-_WFQUPPc3AvO_MO88lG1lnibFrDrccILsLEhMuY0CIAUAUsTxjY0xQxrIQMPrtES_YNIRNL8KMA2I6ZouF9rppqIm23q28btvarqLWldSEqHI-WlNH_YQsuV2IWm0PsXGewn2kI-Parac12VDvKQo7v6fDFTuvdBNo-lMn7OPp8X32Es_fnl9nD_PYCF50MRcm10YUVKE0yNMUiiXPygQyEEZUZQ5VovtfqjwvuFgaAQnHPEUoIS9lIsWE3Qx7-9yfOwqd2ridt_1JhYXIUSYFQq_KB5XxLgRPlTJ1p7va2c7rulEc1BGhGhCqHqH6RqiOB_CfdevrVvvDaZMYTKEX2xX5v1QnXF-9aISq
CitedBy_id	crossref_primary_10_1007_s41365_024_01434_0 crossref_primary_10_1007_s42514_023_00148_w crossref_primary_10_1002_cpe_8014 crossref_primary_10_1109_TSUSC_2023_3314916 crossref_primary_10_1145_3718987 crossref_primary_10_1093_comjnl_bxac017 crossref_primary_10_2139_ssrn_5089085 crossref_primary_10_1371_journal_pone_0250306 crossref_primary_10_1145_3764664 crossref_primary_10_1145_3485008 crossref_primary_10_3390_electronics14061191 crossref_primary_10_1109_ACCESS_2024_3364672 crossref_primary_10_1007_s10766_023_00758_5 crossref_primary_10_1002_cpe_8318 crossref_primary_10_1007_s11227_025_07734_5 crossref_primary_10_1007_s11227_023_05679_1 crossref_primary_10_1007_s11227_025_07295_7 crossref_primary_10_1016_j_ypmed_2023_107603 crossref_primary_10_3390_computation11050097 crossref_primary_10_1007_s11227_024_06394_1 crossref_primary_10_3390_math13071131 crossref_primary_10_1007_s11081_023_09845_5 crossref_primary_10_1016_j_sysarc_2021_102159 crossref_primary_10_3390_computers13100273 crossref_primary_10_1007_s00366_024_01951_x crossref_primary_10_1002_cpe_6260 crossref_primary_10_1007_s42514_023_00174_8 crossref_primary_10_1007_s42514_021_00063_y crossref_primary_10_5194_gmd_18_905_2025 crossref_primary_10_1631_FITEE_2200359
Cites_doi	10.1109/CASES.2013.6662510 10.1145/2150976.2151013 10.1007/978-3-319-43659-3_33 10.1109/IPDPS.2012.59 10.1109/JPROC.2018.2841200 10.1109/IPDPSW.2013.236 10.1145/3078155.3078187 10.1007/s10766-017-0497-y 10.1145/2159430.2159431 10.1109/IPDPS.2012.60 10.1145/2400682.2400713 10.1007/978-3-642-11970-5_14 10.1145/1201775.882363 10.1145/2677036 10.1109/IPDPSW.2016.99 10.1145/1964218.1964221 10.1109/SC.2016.50 10.1007/978-3-642-03869-3_79 10.1007/978-3-319-17473-0_10 10.1002/cpe.4664 10.1109/TPDS.2007.70811 10.1007/s11227-014-1213-y 10.1109/IPDPSW.2016.217 10.1109/IPDPSW.2010.5470823 10.1145/1995896.1995932 10.1145/1854273.1854313 10.1145/3281411.3281422 10.1145/1375581.1375596 10.1109/ISPASS.2010.5452013 10.1109/IPDPS.2011.269 10.1007/978-3-642-45293-2_13 10.1016/j.parco.2009.12.005 10.1145/1366230.1366234 10.1142/S0129626416400028 10.1016/B978-0-12-385963-1.00026-5 10.1007/978-3-319-46079-6_2 10.1109/PACT.2011.60 10.1177/1094342015585845 10.1109/CGO.2013.6494993 10.1007/978-3-642-40047-6_86 10.1109/IPDPSW.2012.226 10.1109/PACT.2017.24 10.1145/2872362.2872411 10.1145/2579561 10.1145/2512436 10.5281/zenodo.571466 10.1145/2688500.2688505 10.1109/INFOCOM41043.2020.9155489 10.1109/SC.2008.5213922 10.1109/MM.2008.31 10.1145/1504176.1504212 10.1007/s10766-019-00646-x 10.1145/1128022.1128027 10.1109/ICDAR.2005.251 10.1007/978-3-319-47099-3_10 10.1109/ICPP.2013.35 10.1145/3135974.3135984 10.1145/2304576.2304623 10.1016/j.jpdc.2011.07.011 10.1007/978-3-319-09967-5_5 10.1145/1015706.1015800 10.1145/321406.321418 10.1145/3148173.3148185 10.1109/ISPA/IUCC.2017.00131 10.1109/CGO.2019.8661189 10.1007/s10766-008-0072-7 10.1145/3132710 10.1007/978-3-642-19595-2_10 10.1145/2304576.2304585 10.1016/j.cpc.2010.12.052 10.1109/LLVM-HPC.2014.9 10.1007/978-3-642-54807-9_9 10.1109/INFOCOM.2017.8057087 10.1109/TPDS.2015.2394802 10.1016/j.jpdc.2013.07.013 10.1016/j.jpdc.2014.07.003 10.1109/ISPASS.2011.5762730 10.1145/2145816.2145832 10.1109/JPROC.2008.917757 10.1109/SC.2010.36 10.1145/2530268.2530269 10.5281/zenodo.1244532 10.1145/2712386.2712405 10.1109/SBAC-PAD.2017.11 10.1109/CCGrid.2014.16 10.1145/1504176.1504194 10.1109/ICPP.2011.45 10.1145/1201775.882362 10.1109/ISSCC.2018.8310168 10.1134/S1995080219050056 10.1145/2491956.2462176 10.1109/SC.2008.5217926 10.1109/IPDPSW.2016.50 10.1145/3078633.3081040 10.1007/s10766-017-0490-5 10.1007/978-3-540-89740-8_1 10.1007/s10766-012-0211-z 10.1016/j.parco.2013.09.003 10.1109/HiPC.2014.7116910 10.1109/TPDS.2017.2755657 10.1147/rd.494.0589 10.1145/1815961.1816021 10.1145/2909437.2909454 10.1109/CGO.2019.8661188 10.1109/ESTIMedia.2012.6507031 10.1109/CGO.2013.6495010 10.1145/1854273.1854318 10.1109/CGO.2019.8661172 10.1109/IPDPS.2014.24 10.1145/1735688.1735698 10.1109/SC.2006.17 10.1145/2568088.2576799 10.1145/2185520.2185528 10.1145/1543135.1542496 10.1109/JPROC.2018.2817118 10.1145/1941553.1941590 10.1007/s11227-012-0789-3 10.1145/3084540 10.1109/TPDS.2020.2978045 10.1145/2807591.2807621 10.1109/IPDPSW.2012.296 10.1109/MS.2011.12 10.1145/3293883.3302577 10.1145/3203217.3203244 10.1109/ACCESS.2019.2936620 10.1142/S0129626411000151 10.1109/IISWC.2009.5306797 10.1109/CGO.2017.7863731 10.1147/rd.515.0559 10.1109/ICPADS.2011.48 10.1109/MM.2006.41 10.1109/TPDS.2015.2442983 10.1145/3293320.3293338 10.1109/TPDS.2010.62 10.1007/978-3-540-89740-8_2 10.1007/s10766-014-0320-y 10.1145/1513895.1513902 10.1109/IPDPS.2018.00061
ContentType	Journal Article
Copyright	China Computer Federation (CCF) 2020 China Computer Federation (CCF) 2020.
Copyright_xml	– notice: China Computer Federation (CCF) 2020 – notice: China Computer Federation (CCF) 2020.
DBID	AAYXX CITATION 8FE 8FG AFKRA ARAPS AZQEC BENPR BGLVJ CCPQU DWQXO GNUQQ HCIFZ JQ2 K7- P62 PHGZM PHGZT PKEHL PQEST PQGLB PQQKQ PQUKI
DOI	10.1007/s42514-020-00039-4
DatabaseName	CrossRef ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central UK/Ireland Advanced Technologies & Computer Science Collection ProQuest Central Essentials - QC ProQuest Central ProQuest Technology Collection ProQuest One ProQuest Central ProQuest Central Student SciTech Premium Collection ProQuest Computer Science Collection Computer Science Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic (retired) ProQuest One Academic UKI Edition
DatabaseTitle	CrossRef Advanced Technologies & Aerospace Collection Computer Science Database ProQuest Central Student Technology Collection ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection ProQuest One Academic Eastern Edition SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central ProQuest One Applied & Life Sciences ProQuest One Academic UKI Edition ProQuest Central Korea ProQuest Central (New) ProQuest One Academic ProQuest One Academic (New)
DatabaseTitleList	Advanced Technologies & Aerospace Collection
Database_xml	– sequence: 1 dbid: BENPR name: ProQuest Central url: https://www.proquest.com/central sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	2524-4930
EndPage	400
ExternalDocumentID	10_1007_s42514_020_00039_4
GroupedDBID	-EM 0R~ 406 AACDK AAHNG AAJBT AASML AATNV AAUYE ABAKF ABDZT ABECU ABFTV ABJNI ABKCH ABMQK ABTEG ABTKH ABTMW ABXPI ACAOD ACDTI ACHSB ACMLO ACOKC ACPIV ACZOJ ADKNI ADTPH ADURQ ADYFF AEFQL AEJRE AEMSY AESKC AFBBN AFKRA AFQWF AGDGC AGJBK AGMZJ AGQEE AGRTI AIGIU AILAN AITGF AJZVZ ALMA_UNASSIGNED_HOLDINGS AMKLP AMXSW AMYLF ARAPS AXYYD BENPR BGLVJ BGNMA CCPQU DPUIP EBLON EBS EJD FIGPU FINBP FNLPD FSGXE GGCAI H13 HCIFZ IKXTQ IWAJR J-C JZLTJ K7- KOV LLZTM M4Y NPVJJ NQJWS NU0 PT4 ROL RSV SJYHP SNE SNPRN SOHCF SOJ SRMVM SSLCW STPWE TSG UOJIU UTJUX VEKWB VFIZW ZMTXR AAYXX ABBRH ABDBE ABFSG ABRTQ ACSTC AEZWR AFDZB AFFHD AFHIU AFOHR AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT PQGLB 8FE 8FG AZQEC DWQXO GNUQQ JQ2 P62 PKEHL PQEST PQQKQ PQUKI
ID	FETCH-LOGICAL-c319t-13c8ac39ef24c216609b17d50703c3fd80f5a039f88913bc305128620d08d4543
IEDL.DBID	K7-
ISICitedReferencesCount	36
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000710561000008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN	2524-4922
IngestDate	Sat Nov 08 14:44:00 EST 2025 Sat Nov 29 04:01:15 EST 2025 Tue Nov 18 21:58:16 EST 2025 Fri Feb 21 02:45:29 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	4
Keywords	Heterogeneous computing Parallel programming models Many-core architectures
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c319t-13c8ac39ef24c216609b17d50703c3fd80f5a039f88913bc305128620d08d4543
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
PQID	2938245920
PQPubID	6587180
PageCount	19
ParticipantIDs	proquest_journals_2938245920 crossref_citationtrail_10_1007_s42514_020_00039_4 crossref_primary_10_1007_s42514_020_00039_4 springer_journals_10_1007_s42514_020_00039_4
PublicationCentury	2000
PublicationDate	20201200 2020-12-00 20201201
PublicationDateYYYYMMDD	2020-12-01
PublicationDate_xml	– month: 12 year: 2020 text: 20201200
PublicationDecade	2020
PublicationPlace	Singapore
PublicationPlace_xml	– name: Singapore – name: Beijing
PublicationTitle	CCF transactions on high performance computing (Online)
PublicationTitleAbbrev	CCF Trans. HPC
PublicationYear	2020
Publisher	Springer Singapore Springer Nature B.V
Publisher_xml	– name: Springer Singapore – name: Springer Nature B.V
References	Introducing rdna architecture.: Tech. rep., AMD Corporation (2019) Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT ’18 (2018) Kim, J., et al.: Translating openmp device constructs to opencl using unnecessary data transfer elimination. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2016) Marqués, R., et al.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science (2013) Szuppe, J.: Boost.compute: A parallel computing library for C++ based on opencl. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL (2016) Yang, C., et al.: O2render: An opencl-to-renderscript translator for porting across various GPUs or CPUs. In: IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, ESTIMedia (2012) Bodin, F., Romain, D., Colin De Verdiere, G.: One OpenCL to Rule Them All? In: International Workshop on Multi-/Many-core Computing Systems, MuCoCoS (2013) Wen, Y., et al.: Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In: HiPC (2014) The El Capitan Supercomputer.: https://www.cray.com/company/customers/lawrence-livermore-national-lab (2020) TI’s OpenCL Implementation.: https://git.ti.com/cgit/opencl (2020) ChenDCharacterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+Int. J. Parallel Program.202048809710.1007/s10766-019-00646-x Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pp. 8024–8035 (2019) BaeHThe cetus source-to-source compiler infrastructure: overview and evaluationInt. J. Parallel Program.20134175376710.1007/s10766-012-0211-z Unat, D., et al.: Mint: realizing CUDA performance in 3d stencil methods with annotated C. In: Proceedings of the 25th International Conference on Supercomputing, (2011) Heller, T., et al.: Closing the performance gap with modern C++. In: High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P∧\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\wedge$$\end{document}3MA, VHPC, WOPSSS (2016) Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT (2017) Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT (2010) He, J., et al.: Openmdsp: Extending openmp to program multi-core DSP. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT (2011) Intel’s OneAPI.: https://software.intel.com/en-us/oneapi (2020) Beignet OpenCL.: https://www.freedesktop.org/wiki/ Software/Beignet/ (2020) Nvidia’s next generation cuda compute architecture.: Fermi. NVIDIA Corporation, Tech. rep. (2009) Boyer, M., et al.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013) Fang, J., et al.: Implementing and evaluating opencl on an armv8 multi-core CPU. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) (2017) Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC (2014) Rudy, G., et al.: A programming language interface to describe transformations and code generation. In: Languages and Compilers for Parallel Computing - 23rd International Workshop, LCPC (2010) HIP.: Heterogeneous-Compute Interface for Portability. https://github.com/RadeonOpenCompute/hcc (2020) de Fine Licht, J., Hoefler, T.: hlslib: Software engineering for hardware design. CoRR (2019) Mendonca, G.S.D., et al.: Dawncc: Automatic annotation for data parallelism and offloading. TACO (2017) Wang, Z.: Machine learning based mapping of data and streaming parallelism to multi-cores. Ph.D. thesis, University of Edinburgh (2011) Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2015) Mishra, A., et al.: Kernel fusion/decomposition for automatic gpu-offloading. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) HLSL.: The High Level Shading Language for DirectX. https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl (2018) de Carvalho Moreira, W., et al.: Exploring heterogeneous mobile architectures with a high-level programming model. In: 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2017) Green500 Supercomputers.: https://www.top500.org/green500/ (2020) ErnstssonASkepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systemsInt. J. Parallel Program.201846628010.1007/s10766-017-0490-5 Grewe, D., et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013b) You, Y., et al.: Virtcl: a framework for opencl device abstraction and management. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP (2015) Williams, S., et al.: The potential of the cell processor for scientific computing. In: Proceedings of the Third Conference on Computing Frontiers (2006) Lee, S., Eigenmann, R.: Openmp to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009) O’BrienKSupporting openmp on cellInt. J. Parallel Program.20083628931110.1007/s10766-008-0072-7 Ren, J., et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017) The Tianhe-2 Supercomputer.: https://top500.org/system/177999 (2020) Baskaran, M.M., et al.: Automatic c-to-cuda code generation for affine programs. In: R. Gupta (ed.) 19th International Conference on Compiler Construction (CC) (2010) DemidovDAmgcl: an efficient, flexible, and extensible algebraic multigrid implementationLobachevskii J. Math.201940535546397650010.1134/S1995080219050056 The Aurora Supercomputer.: https://aurora.alcf.anl.gov/ (2020) GschwindMSynergistic processing in cell’s multicore architectureIEEE Micro200626102410.1109/MM.2006.41 Nvidia turing gpu architecture.: Tech. rep., NVIDIA Corporation (2018) Amini, M., et al.: Static compilation analysis for host-accelerator communication optimization. In: Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC (2011) Kim, J., et al.: Bridging opencl and CUDA: a comparative analysis and translation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015) MPI.: Message Passing Interface. https://computing.llnl.gov/tutorials/mpi/ (2020) The OpenCL Conformance Tests.: https://github.com/KhronosGroup/OpenCL-CTS (2020) LindholmENVIDIA tesla: a unified graphics and computing architectureIEEE Micro200828395510.1109/MM.2008.31 Chandrasekhar, A., et al.: IGC: the open source intel graphics compiler. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) Intel Inc.: hStreams Architecture for MPSS 3.5 (2015) Kim, Y., et al.: Translating CUDA to opencl for hardware generation using neural machine translation. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) Nugteren, C., Corporaal, H.: Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU (2012) FangJEvaluating multiple streams on heterogeneous platformsParallel Process. Lett.20162641640002358509810.1142/S0129626416400028 Gregory, K., Miller, A.: C++ AMP: accelerated massive parallelism with microsoft visual C++ (2012) Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC (2010) Gregg, C., et al.: Where is the data? why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2011) OpenCL.: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/ (2020) ChenTCell broadband engine architecture and its first implementation—a performance viewIBM J. Res. Dev.20075155957210.1147/rd.515.0559 Bellens, P., et al.: Cellss: a programming model for the cell BE architecture. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (2006) Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM (2014) High-level abstractions for performance.: Portability and continuity of scientific software on future computing systems. University of Oxford, Tech. rep. (2014) Cole, M.I.: Algorithmic skeletons: structured management of parallel computation (1989) Trevett, N.: Opencl, sycl and spir—the next steps. Tech. rep, OpenCL Working Group (2019) Hong, S., et al.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2011) DuranAOmpss: a proposal for programming heterogeneous multi-core architecturesParallel Process. Lett.201121173193281200010.1142/S0129626411000151 Lepley, T., et al.: A novel compilation approach for image processing graphs on a V Sanz Marco (39_CR160) 2019; 19 39_CR8 39_CR9 P Jääskeläinen (39_CR93) 2015; 43 39_CR181 39_CR184 39_CR183 39_CR186 39_CR185 39_CR6 39_CR7 39_CR187 39_CR4 39_CR5 39_CR2 39_CR3 MJ Harvey (39_CR78) 2011; 182 39_CR1 X Liao (39_CR113) 2018; 19 39_CR191 39_CR190 39_CR193 MK Gardner (39_CR61) 2013; 39 39_CR192 39_CR195 39_CR194 39_CR197 39_CR196 39_CR199 39_CR198 D Chen (39_CR29) 2020; 48 K O’Brien (39_CR138) 2008; 36 N Bell (39_CR18) 2012 39_CR99 G Tournavitis (39_CR182) 2009; 44 T Chen (39_CR28) 2007; 51 39_CR96 39_CR97 39_CR98 39_CR91 39_CR92 HC Edwards (39_CR50) 2014; 74 DP Scarpazza (39_CR162) 2008; 19 M Viñas (39_CR189) 2018; 20 AR Brodtkorb (39_CR23) 2010; 18 S Tomov (39_CR180) 2010; 36 39_CR207 39_CR206 39_CR209 39_CR208 M Haidl (39_CR74) 2018; 46 M Gschwind (39_CR71) 2006; 26 A Duran (39_CR49) 2011; 21 39_CR201 39_CR200 39_CR202 39_CR205 L Yuan (39_CR203) 2019; 7 39_CR204 TD Han (39_CR76) 2011; 22 I Buck (39_CR24) 2004; 23 L Seiler (39_CR163) 2009; 27 39_CR108 39_CR107 H Bae (39_CR12) 2013; 41 39_CR109 E Lindholm (39_CR114) 2008; 28 M Viñas (39_CR188) 2013; 73 39_CR48 39_CR44 39_CR46 39_CR47 39_CR41 39_CR42 39_CR102 39_CR43 39_CR101 39_CR104 39_CR103 39_CR106 39_CR105 TT Dao (39_CR37) 2018; 29 D De Sensi (39_CR40) 2017; 14 39_CR119 RM Karp (39_CR95) 1967; 14 39_CR118 39_CR38 39_CR33 39_CR34 39_CR35 39_CR36 D Demidov (39_CR45) 2019; 40 39_CR111 39_CR30 39_CR110 39_CR31 39_CR32 39_CR112 39_CR115 M Steuwer (39_CR167) 2014; 69 39_CR117 39_CR116 39_CR129 39_CR26 39_CR27 39_CR22 39_CR120 39_CR25 39_CR122 39_CR121 39_CR20 39_CR124 39_CR21 39_CR123 39_CR126 39_CR125 39_CR128 39_CR127 39_CR19 39_CR15 39_CR16 39_CR17 39_CR11 39_CR13 39_CR131 39_CR14 39_CR130 39_CR133 39_CR132 39_CR135 39_CR10 39_CR134 39_CR137 39_CR136 J Gómez-Luna (39_CR63) 2012; 72 39_CR139 JA Kahle (39_CR94) 2005; 49 39_CR90 39_CR88 39_CR140 39_CR89 39_CR142 39_CR141 39_CR84 39_CR144 39_CR85 39_CR143 39_CR86 39_CR146 39_CR87 39_CR145 39_CR80 39_CR148 39_CR81 39_CR147 39_CR82 39_CR83 39_CR149 A Ernstsson (39_CR52) 2018; 46 NE Davis (39_CR39) 2012; 62 39_CR77 39_CR151 39_CR150 39_CR79 39_CR153 39_CR152 39_CR73 39_CR155 39_CR154 39_CR75 39_CR157 39_CR156 39_CR159 J Fang (39_CR57) 2016; 26 39_CR70 39_CR158 39_CR72 39_CR66 39_CR67 39_CR161 39_CR68 39_CR164 39_CR69 39_CR62 39_CR166 39_CR165 39_CR64 39_CR168 39_CR65 39_CR169 39_CR60 W Kim (39_CR100) 2011; 28 39_CR59 39_CR171 39_CR170 39_CR55 39_CR173 39_CR56 39_CR172 39_CR175 39_CR58 39_CR174 39_CR51 39_CR177 39_CR176 39_CR53 39_CR179 39_CR54 39_CR178
References_xml	– reference: Nvidia’s next generation cuda compute architecture.: Kepler tm gk110/210. NVIDIA Corporation, Tech. rep. (2014) – reference: Sathre, P., et al.: On the portability of cpu-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia (2019) – reference: Demidov, D., et al.: ddemidov/vexcl: 1.4.1 (2017). https://doi.org/10.5281/zenodo.571466 – reference: Haidl, M., et al.: Pacxxv2 + RV: an llvm-based portable high-performance programming model. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC (2017) – reference: Harris, M.J., et al.: Simulation of cloud dynamics on graphics hardware. In: Proceedings of the 2003 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (2003) – reference: The Tianhe-2 Supercomputer.: https://top500.org/system/177999 (2020) – reference: “vega” instruction set architecture.: Tech. rep., AMD Corporation (2017) – reference: HanTDhicuda: high-level GPGPU programmingIEEE Trans. Parallel Distrib. Syst.201122789010.1109/TPDS.2010.62 – reference: BrodtkorbARState-of-the-art in heterogeneous computingSci. Program.201018133 – reference: Stratton, J.A., et al.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008) – reference: Breitbart, J., Fohry, C.: Opencl: an effective programming model for data parallel computations at the cell broadband engine. In: 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2010) – reference: Ciechanowicz, P., et al.: The münster skeleton library muesli: a comprehensive overview. Working Papers, ERCIS-European Research Center for Information Systems, No. 7 (2009) – reference: LiaoXMoving from exascale to zettascale computing: challenges and techniquesFront. IT EE20181912361244 – reference: Komoda, T., et al.: Integrating multi-gpu execution in an openacc compiler. In: 42nd International Conference on Parallel Processing, ICPP (2013) – reference: He, J., et al.: Openmdsp: Extending openmp to program multi-core DSP. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT (2011) – reference: SteuwerMGorlatchSSkelcl: a high-level extension of opencl for multi-gpu systemsJ. Supercomput.201469232510.1007/s11227-014-1213-y – reference: ZiiLABS OpenCL.: http://www.ziilabs.com/products/ software/opencl.php (2020) – reference: EdwardsHCKokkos: Enabling manycore performance portability through polymorphic memory access patternsJ. Parallel Distrib. Comput.2014743202321610.1016/j.jpdc.2014.07.003 – reference: Nvidia turing gpu architecture.: Tech. rep., NVIDIA Corporation (2018) – reference: Verdoolaege, S., et al.: Polyhedral parallel code generation for CUDA. ACM TACO (2013) – reference: Bodin, F., Romain, D., Colin De Verdiere, G.: One OpenCL to Rule Them All? In: International Workshop on Multi-/Many-core Computing Systems, MuCoCoS (2013) – reference: Nomizu, T., et al.: Implementation of xcalablemp device acceleration extention with opencl. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSWP (2012) – reference: Dastgeer, U., et al.: Adaptive implementation selection in the skepu skeleton programming library. In: Advanced Parallel Processing Technologies—10th International Symposium, APPT (2013) – reference: DemidovDAmgcl: an efficient, flexible, and extensible algebraic multigrid implementationLobachevskii J. Math.201940535546397650010.1134/S1995080219050056 – reference: Membarth, R., et al.: Hipacc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$^{\text{cc}}$$\end{document}: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst (2016) – reference: Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR (2016) – reference: Mendonca, G.S.D., et al.: Dawncc: Automatic annotation for data parallelism and offloading. TACO (2017) – reference: Wang, Z., et al.: Exploitation of GPUs for the parallelisation of probably parallel legacy code. In: CC ’14 (2014a) – reference: Qualcomm snapdragon mobile platform opencl general programming and optimization.: Tech. rep., Qualcomm Corporation (2017) – reference: Owens, J.D., et al.: A survey of general-purpose computation on graphics hardware. In: Eurographics, pp. 21–51 (2005) – reference: TI’s OpenCL Implementation.: https://git.ti.com/cgit/opencl (2020) – reference: Kim, J., et al.: Bridging opencl and CUDA: a comparative analysis and translation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015) – reference: Leung, A., et al.: A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In: Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2010) – reference: Lee, V.W., et al.: Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: 37th International Symposium on Computer Architecture, ISCA (2010) – reference: Meswani, M.R., et al.: Modeling and predicting performance of high performance computing applications on hardware accelerators. IJHPCA (2013) – reference: You, Y., et al.: Virtcl: a framework for opencl device abstraction and management. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP (2015) – reference: Lee, S., Eigenmann, R.: Openmp to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009) – reference: BellNHoberockJMeiWHwuWChapter 26-thrust: a productivity-oriented library for cudaGPU Computing Gems Jade Edition, Applications of GPU Computing Series2012BurlingtonMorgan Kaufmann35937110.1016/B978-0-12-385963-1.00026-5 – reference: Grasso, I., et al.: Energy efficient HPC on embedded socs: optimization techniques for mali GPU. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS (2014) – reference: HIP.: Heterogeneous-Compute Interface for Portability. https://github.com/RadeonOpenCompute/hcc (2020) – reference: DaoTTLeeJAn auto-tuner for opencl work-group size on GPUsIEEE Trans. Parallel Distrib. Syst.20182928329610.1109/TPDS.2017.2755657 – reference: Gregg, C., et al.: Where is the data? why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2011) – reference: MPI.: Message Passing Interface. https://computing.llnl.gov/tutorials/mpi/ (2020) – reference: The Aurora Supercomputer.: https://aurora.alcf.anl.gov/ (2020) – reference: Demidov, D., et al.: ddemidov/amgcl: 1.2.0 (2018). https://doi.org/10.5281/zenodo.1244532 – reference: Green500 Supercomputers.: https://www.top500.org/green500/ (2020) – reference: Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS (2018a) – reference: The El Capitan Supercomputer.: https://www.cray.com/company/customers/lawrence-livermore-national-lab (2020) – reference: Membarth, R., et al.: Generating device-specific GPU code for local operators in medical imaging. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS (2012) – reference: Wang, Z., O’Boyle, M.: Machine learning in compiler optimisation. In: Proceedings of IEEE (2018) – reference: Newburn, C.J., et al.: Heterogeneous streaming. In: IPDPSW (2016) – reference: SeilerLLarrabee: a many-core x86 architecture for visual computingIEEE Micro200927115 – reference: Zhang, P., et al.: MOCL: an efficient opencl implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, CF (2018bb) – reference: PIPS.: Automatic Parallelizer and Code Transformation Framework. https://pips4u.org/ (2020) – reference: Owens, J.D., et al.: GPU computing. Proceedings of the IEEE (2008) – reference: Baskaran, M.M., et al.: Automatic c-to-cuda code generation for affine programs. In: R. Gupta (ed.) 19th International Conference on Compiler Construction (CC) (2010) – reference: Krüger, J.H., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph (2003) – reference: Martinez, G., et al.: CU2CL: A cuda-to-opencl translator for multi- and many-core architectures. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS (2011) – reference: Hong, S., et al.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2011) – reference: Williams, S., et al.: The potential of the cell processor for scientific computing. In: Proceedings of the Third Conference on Computing Frontiers (2006) – reference: van Werkhoven, B., et al.: Performance models for CPU-GPU data transfers. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2014) – reference: Li, Z., et al.: Evaluating the performance impact of multiple streams on the mic-based heterogeneous platform. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016a) – reference: DavisNEParadigmatic shifts for exascale supercomputingJ. Supercomput.2012621023104410.1007/s11227-012-0789-3 – reference: GalliumCompute.: https://dri.freedesktop.org/wiki /GalliumCompute/ (2020) – reference: ROCm Runtime.: https://github.com/RadeonOpenCompute /ROCR-Runtime (2020) – reference: Nugteren, C., Corporaal, H.: Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU (2012) – reference: Trevett, N.: Opencl, sycl and spir—the next steps. Tech. rep, OpenCL Working Group (2019) – reference: LindholmENVIDIA tesla: a unified graphics and computing architectureIEEE Micro200828395510.1109/MM.2008.31 – reference: Amd cal programming guide v2.0.: Tech. rep., AMD Corporation (2010) – reference: Giles, M.B., et al.: Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Performance Evaluation Review (2011) – reference: High-level abstractions for performance.: Portability and continuity of scientific software on future computing systems. University of Oxford, Tech. rep. (2014) – reference: GardnerMKCharacterizing the challenges and evaluating the efficacy of a cuda-to-opencl translatorParallel Comput.20133976978610.1016/j.parco.2013.09.003 – reference: Intel Inc.: hStreams Architecture for MPSS 3.5 (2015) – reference: ChenTCell broadband engine architecture and its first implementation—a performance viewIBM J. Res. Dev.20075155957210.1147/rd.515.0559 – reference: PGI CUDA C/C++ for x86.: https://developer.nvidia.com/pgi-cuda-cc-x86 (2020) – reference: Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2015) – reference: Balaprakash, P., et al.: Autotuning in high-performance computing applications. In: Proceedings of the IEEE (2018) – reference: Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT (2017) – reference: ErnstssonASkepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systemsInt. J. Parallel Program.201846628010.1007/s10766-017-0490-5 – reference: HaidlMGorlatchSHigh-level programming for many-cores using C++14 and the STLInt. J. Parallel Program.201846234110.1007/s10766-017-0497-y – reference: ChenDCharacterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+Int. J. Parallel Program.202048809710.1007/s10766-019-00646-x – reference: Yang, C., et al.: O2render: An opencl-to-renderscript translator for porting across various GPUs or CPUs. In: IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, ESTIMedia (2012) – reference: de Carvalho Moreira, W., et al.: Exploring heterogeneous mobile architectures with a high-level programming model. In: 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2017) – reference: Barker, K.J., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008) – reference: Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015) – reference: Mark, W.R., et al.: Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph (2003) – reference: Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC (2014) – reference: Diamos, G.F., et al.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: 19th International Conference on Parallel Architectures and Compilation Techniques, PACT (2010) – reference: Bellens, P., et al.: Cellss: a programming model for the cell BE architecture. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (2006) – reference: Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pp. 8024–8035 (2019) – reference: O’BrienKSupporting openmp on cellInt. J. Parallel Program.20083628931110.1007/s10766-008-0072-7 – reference: ViñasMExploiting heterogeneous parallelism with the heterogeneous programming libraryJ. Parallel Distrib. Comput.2013731627163810.1016/j.jpdc.2013.07.013 – reference: The OpenACC API specification for parallel programming.: https://www.openacc.org/ (2020) – reference: Govindaraju, N.K., et al.: High performance discrete fourier transforms on graphics processors. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008) – reference: KarpRMThe organization of computations for uniform recurrence equationsJ. ACM (JACM)19671456359023460410.1145/321406.321418 – reference: Nvidia’s next generation cuda compute architecture.: Fermi. NVIDIA Corporation, Tech. rep. (2009) – reference: Beckingsale, D., et al.: Performance portable C++ programming with RAJA. In: Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammiDng, PPoPP (2019) – reference: The OpenCL Conformance Tests.: https://github.com/KhronosGroup/OpenCL-CTS (2020) – reference: Sycl integrates opencl devices with modern c++.: Tech. Rep. version 1.2.1 revison 6, The Khronos Group (2019) – reference: Arevalo, A., et al.: Programming the cell broadband engine: examples and best practices (2007) – reference: FangJEvaluating multiple streams on heterogeneous platformsParallel Process. Lett.20162641640002358509810.1142/S0129626416400028 – reference: BuckIBrook for GPUs: stream computing on graphics hardwareACM Trans. Graph20042377778610.1145/1015706.1015800 – reference: Intel’s OneAPI.: https://software.intel.com/en-us/oneapi (2020) – reference: Szuppe, J.: Boost.compute: A parallel computing library for C++ based on opencl. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL (2016) – reference: HarveyMJSwan: a tool for porting CUDA programs to openclComput. Phys. Commun.20111821093109910.1016/j.cpc.2010.12.052 – reference: Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT ’18 (2018) – reference: Wen, Y., et al.: Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In: HiPC (2014) – reference: NVIDIA CUDA Toolkit.: https://developer.nvidia.com/cuda-toolkit (2020) – reference: Zhao, J., et al.: Predicting cross-core performance interference on multicore processors with regression analysis. IEEE TPDS (2016) – reference: Muralidharan, S., et al.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2016) – reference: Ragan-Kelley, J., et al.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2013) – reference: Boyer, M., et al.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013) – reference: Sidelnik, A., et al.: Performance portability with the chapel language. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 582–594 (2012) – reference: Top500 Supercomputers.: https://www.top500.org/ (2020) – reference: Yan, Y., et al.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP (2015) – reference: Patterson, D.A.: 50 years of computer architecture: From the mainframe CPU to the domain-specific tpu and the open RISC-V instruction set. In: 2018 IEEE International Solid-State Circuits Conference, ISSCC (2018) – reference: OpenCL.: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/ (2020) – reference: Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM (2014) – reference: Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES (2017) – reference: De SensiDBringing parallel patterns out of the corner: the p3 arsec benchmark suiteACM Trans. Archit. Code Optim. (TACO)20171412610.1145/3132710 – reference: Lepley, T., et al.: A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES (2013) – reference: GschwindMSynergistic processing in cell’s multicore architectureIEEE Micro200626102410.1109/MM.2006.41 – reference: Ravi, N., et al.: Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In: International Conference on Supercomputing, ICS (2012) – reference: Ren, J., et al.: Camel: Smart, adaptive energy optimization for mobile web interactions. In: IEEE Conference on Computer Communications (INFOCOM) (2020) – reference: Alfieri, R.A.: An efficient kernel-based implementation of POSIX threads. In: USENIX Summer 1994 Technical Conference. USENIX Association (1994) – reference: The OpenMP API specification for parallel programming.: https://www.openmp.org/ (2020) – reference: Common-Shader Core.: https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-common-core?redirectedfrom=MSDN (2018) – reference: Introducing rdna architecture.: Tech. rep., AMD Corporation (2019) – reference: Cummins, C., et al.: Synthesizing benchmarks for predictive modeling. In: CGO (2017) – reference: Marqués, R., et al.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science (2013) – reference: Grewe, D., et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013b) – reference: Heller, T., et al.: Using HPX and libgeodecomp for scaling HPC applications on heterogeneous supercomputers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA (2013) – reference: Ueng, S., et al.: Cuda-lite: Reducing GPU programming complexity. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008) – reference: Kim, J., et al.: Translating openmp device constructs to opencl using unnecessary data transfer elimination. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2016) – reference: Mishra, A., et al.: Kernel fusion/decomposition for automatic gpu-offloading. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) – reference: Kim, Y., et al.: Translating CUDA to opencl for hardware generation using neural machine translation. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) – reference: Unat, D., et al.: Mint: realizing CUDA performance in 3d stencil methods with annotated C. In: Proceedings of the 25th International Conference on Supercomputing, (2011) – reference: Copik, M., Kaiser, H.: Using SYCL as an implementation framework for hpx.compute. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL (2017) – reference: Nvidia tesla v100 gpu architecture.: Tech. rep., NVIDIA Corporation (2017) – reference: FreeOCL.: http://www.zuzuf.net/FreeOCL/ (2020) – reference: Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT (2010) – reference: Li, Z., et al.: Streaming applications on heterogeneous platforms. In: Network and Parallel Computing—13th IFIP WG 10.3 International Conference, NPC (2016b) – reference: KimWVossMMulticore desktop programming with intel threading building blocksIEEE Softw.201128233110.1109/MS.2011.12 – reference: Sanz MarcoVOptimizing deep learning inference on embedded systems through adaptive model selectionACM Trans. Embed. Comput.201919128 – reference: TomovSTowards dense linear algebra for hybrid GPU accelerated manycore systemsParallel Comput.201036232240276259010.1016/j.parco.2009.12.005 – reference: BaeHThe cetus source-to-source compiler infrastructure: overview and evaluationInt. J. Parallel Program.20134175376710.1007/s10766-012-0211-z – reference: Andrade, G., et al.: Parallelme: A parallel mobile engine to explore heterogeneity in mobile computing architectures. In: Euro-Par 2016: Parallel Processing—22nd International Conference on Parallel and Distributed Computing (2016) – reference: Chandrasekhar, A., et al.: IGC: the open source intel graphics compiler. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019) – reference: Gómez-LunaJPerformance models for asynchronous data transfers on consumer graphics processing unitsJ. Parallel Distrib. Comput.2012721117112610.1016/j.jpdc.2011.07.011 – reference: ScarpazzaDPEfficient breadth-first search on the cell/be processorIEEE Trans. Parallel Distrib. Syst.2008191381139510.1109/TPDS.2007.70811 – reference: Fang, J., et al.: Implementing and evaluating opencl on an armv8 multi-core CPU. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) (2017) – reference: ROCm.: A New Era in Open GPU Computing. https://www.amd.com/en/graphics/servers-solutions-rocm-hpc (2020) – reference: Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013) – reference: Hong, S., et al.: Green-marl: a DSL for easy and efficient graph analysis. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2012) – reference: Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014b) – reference: Bader, D.A., Agarwal, V.: FFTC: fastest Fourier transform for the IBM cell broadband engine. In: High Performance Computing, HiPC (2007) – reference: Beignet OpenCL.: https://www.freedesktop.org/wiki/ Software/Beignet/ (2020) – reference: Kistler, M., et al.: Petascale computing with accelerators. In: D.A. Reed, V. Sarkar (eds.) Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009) – reference: Diamos, C., et al.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2012) – reference: de Fine Licht, J., Hoefler, T.: hlslib: Software engineering for hardware design. CoRR (2019) – reference: HLSL.: The High Level Shading Language for DirectX. https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl (2018) – reference: Steinkrau, D., et al.: Using GPUs for machine learning algorithms. In: Eighth International Conference on Document Analysis and Recognition (ICDAR. IEEE Computer Society (2005) – reference: Fang, J., et al.: Test-driving intel xeon phi. In: ACM/SPEC International Conference on Performance Engineering (ICPE), pp. 137–148 (2014) – reference: KahleJAIntroduction to the cell multiprocessorIBM J. Res. Dev.20054958960410.1147/rd.494.0589 – reference: Steuwer, M., et al.: Skelcl—a portable skeleton library for high-level GPU programming. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2011) – reference: Ayguadé, E., et al.: An extension of the starss programming model for platforms with multiple GPUs. In: Euro-Par 2009 Parallel Processing (2009) – reference: Rudy, G., et al.: A programming language interface to describe transformations and code generation. In: Languages and Compilers for Parallel Computing - 23rd International Workshop, LCPC (2010) – reference: Cole, M.I.: Algorithmic skeletons: structured management of parallel computation (1989) – reference: Nvidia geforce gtx 980.: Tech. rep., NVIDIA Corporation (2014) – reference: Ragan-Kelley, J., et al.: Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph (2012) – reference: Kim, J., et al.: Snucl: an opencl framework for heterogeneous CPU/GPU clusters. In: International Conference on Supercomputing, ICS (2012) – reference: Parallel Patterns Library.: https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-patterns-library-ppl?view=vs-2019 (2016) – reference: Pham, D., et al.: The design methodology and implementation of a first-generation CELL processor: a multi-core soc. In: Proceedings of the IEEE 2005 Custom Integrated Circuits Conference, CICC (2005) – reference: TournavitisGTowards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mappingACM Sigplan Not.20094417718710.1145/1543135.1542496 – reference: Fang, J., et al.: A comprehensive performance comparison of CUDA and opencl. In: ICPP (2011) – reference: Amini, M., et al.: Static compilation analysis for host-accelerator communication optimization. In: Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC (2011) – reference: Heller, T., et al.: Closing the performance gap with modern C++. In: High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P∧\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\wedge$$\end{document}3MA, VHPC, WOPSSS (2016) – reference: Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013a) – reference: JääskeläinenPpocl: a performance-portable opencl implementationInt. J. Parallel Programm.20154375278510.1007/s10766-014-0320-y – reference: Wang, Z., O’Boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013) – reference: Wang, Z.: Machine learning based mapping of data and streaming parallelism to multi-cores. Ph.D. thesis, University of Edinburgh (2011) – reference: Zenker, E., et al.: Alpaka—an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016) – reference: Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC (2010) – reference: YuanLUsing machine learning to optimize web interactions on heterogeneous mobile systemsIEEE Access2019713939413940810.1109/ACCESS.2019.2936620 – reference: Amd brook+ programming.: Tech. rep., AMD Corporation (2007) – reference: HCC.: Heterogeneous Compute Compiler. https://gpuopen.com/compute-product/hcc-heterogeneous-compute-compiler/ (2020) – reference: The Frontier Supercomputer.: https://www.olcf.ornl.gov/frontier/ (2020) – reference: Crawford, C.H., et al.: Accelerating computing with the cell broadband engine processor. In: Proceedings of the 5th Conference on Computing Frontiers (2008) – reference: Nvidia tesla p100.: Tech. rep., NVIDIA Corporation (2016) – reference: Heler, T., et al.: Hpx—an open source c++ standard library for parallelism and concurrency. In: OpenSuCo (2017) – reference: Directcompute programming guide.: Tech. rep., NVIDIA Corporation (2010) – reference: Wong, H., et al.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2010) – reference: Marco, V.S., et al.: Improving spark application throughput via memory aware task co-location: a mixture of experts approach. In: Middleware (2017) – reference: Merrill, D., et al.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2012) – reference: Ren, J., et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017) – reference: Komornicki, A., et al.: Roadrunner: hardware and software overview (2009) – reference: ViñasMHeterogeneous distributed computing based on high-level abstractionsPract. Exp. Concurr. Comput.201820e466410.1002/cpe.4664 – reference: DuranAOmpss: a proposal for programming heterogeneous multi-core architecturesParallel Process. Lett.201121173193281200010.1142/S0129626411000151 – reference: Liu, B., et al.: Software pipelining for graphic processing unit acceleration: partition, scheduling and granularity. IJHPCA (2016) – reference: Fang, J.: Towards a systematic exploration of the optimization space for many-core processors. Ph.D. thesis, Delft University of Technology, Netherlands (2014) – reference: Gregory, K., Miller, A.: C++ AMP: accelerated massive parallelism with microsoft visual C++ (2012) – reference: Han, T.D., et al.: hicuda: a high-level directive-based language for GPU programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2009) – reference: Zhang, P., et al.: Optimizing streaming parallelism on heterogeneous many-core architectures. IEEE TPDS (2020) – reference: GalliumCompute.: https://github.com/intel/compute-runtime (2020) – reference: Che, S., et al.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE Computer Society (2009) – reference: AMD’s OpenCL Implementation.: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime (2020) – reference: Kudlur, M., et al.: Orchestrating the execution of stream programs on multicore platforms. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, PLDI (2008) – reference: Renderscript Compute.: http://developer.android.com/guide/topics/renderscript/compute.html (2020) – reference: Intel Manycore Platform Software Stack.: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss (2020) – ident: 39_CR109 doi: 10.1109/CASES.2013.6662510 – ident: 39_CR60 – ident: 39_CR88 doi: 10.1145/2150976.2151013 – ident: 39_CR174 – ident: 39_CR31 – ident: 39_CR8 doi: 10.1007/978-3-319-43659-3_33 – ident: 39_CR77 – ident: 39_CR121 doi: 10.1109/IPDPS.2012.59 – ident: 39_CR13 doi: 10.1109/JPROC.2018.2841200 – ident: 39_CR21 doi: 10.1109/IPDPSW.2013.236 – ident: 39_CR33 doi: 10.1145/3078155.3078187 – volume: 46 start-page: 23 year: 2018 ident: 39_CR74 publication-title: Int. J. Parallel Program. doi: 10.1007/s10766-017-0497-y – ident: 39_CR130 doi: 10.1145/2159430.2159431 – ident: 39_CR164 doi: 10.1109/IPDPS.2012.60 – ident: 39_CR156 – ident: 39_CR187 doi: 10.1145/2400682.2400713 – ident: 39_CR15 doi: 10.1007/978-3-642-11970-5_14 – ident: 39_CR104 doi: 10.1145/1201775.882363 – ident: 39_CR190 doi: 10.1145/2677036 – ident: 39_CR42 – ident: 39_CR111 doi: 10.1109/IPDPSW.2016.99 – ident: 39_CR62 doi: 10.1145/1964218.1964221 – ident: 39_CR4 – ident: 39_CR99 doi: 10.1109/SC.2016.50 – ident: 39_CR10 doi: 10.1007/978-3-642-03869-3_79 – ident: 39_CR139 doi: 10.1007/978-3-319-17473-0_10 – volume: 20 start-page: e4664 year: 2018 ident: 39_CR189 publication-title: Pract. Exp. Concurr. Comput. doi: 10.1002/cpe.4664 – ident: 39_CR66 – volume: 19 start-page: 1381 year: 2008 ident: 39_CR162 publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2007.70811 – volume: 69 start-page: 23 year: 2014 ident: 39_CR167 publication-title: J. Supercomput. doi: 10.1007/s11227-014-1213-y – ident: 39_CR128 doi: 10.1109/IPDPSW.2016.217 – ident: 39_CR9 – volume: 27 start-page: 1 year: 2009 ident: 39_CR163 publication-title: IEEE Micro – ident: 39_CR22 doi: 10.1109/IPDPSW.2010.5470823 – ident: 39_CR89 – ident: 39_CR144 – ident: 39_CR185 doi: 10.1145/1995896.1995932 – ident: 39_CR196 doi: 10.1145/1854273.1854313 – ident: 39_CR3 – ident: 39_CR175 – ident: 39_CR155 doi: 10.1145/3281411.3281422 – ident: 39_CR105 doi: 10.1145/1375581.1375596 – ident: 39_CR133 – ident: 39_CR199 doi: 10.1109/ISPASS.2010.5452013 – ident: 39_CR30 – ident: 39_CR166 doi: 10.1109/IPDPS.2011.269 – ident: 39_CR181 – ident: 39_CR38 doi: 10.1007/978-3-642-45293-2_13 – volume: 36 start-page: 232 year: 2010 ident: 39_CR180 publication-title: Parallel Comput. doi: 10.1016/j.parco.2009.12.005 – ident: 39_CR34 doi: 10.1145/1366230.1366234 – volume: 26 start-page: 1640002 issue: 4 year: 2016 ident: 39_CR57 publication-title: Parallel Process. Lett. doi: 10.1142/S0129626416400028 – ident: 39_CR178 – start-page: 359 volume-title: GPU Computing Gems Jade Edition, Applications of GPU Computing Series year: 2012 ident: 39_CR18 doi: 10.1016/B978-0-12-385963-1.00026-5 – ident: 39_CR58 – ident: 39_CR82 doi: 10.1007/978-3-319-46079-6_2 – ident: 39_CR80 doi: 10.1109/PACT.2011.60 – ident: 39_CR115 doi: 10.1177/1094342015585845 – ident: 39_CR70 doi: 10.1109/CGO.2013.6494993 – ident: 39_CR118 doi: 10.1007/978-3-642-40047-6_86 – ident: 39_CR124 doi: 10.1109/IPDPSW.2012.226 – ident: 39_CR146 – ident: 39_CR169 – ident: 39_CR35 doi: 10.1109/PACT.2017.24 – ident: 39_CR127 doi: 10.1145/2872362.2872411 – ident: 39_CR5 – ident: 39_CR192 doi: 10.1145/2579561 – ident: 39_CR195 doi: 10.1145/2512436 – ident: 39_CR32 – ident: 39_CR44 doi: 10.5281/zenodo.571466 – ident: 39_CR202 doi: 10.1145/2688500.2688505 – ident: 39_CR135 – ident: 39_CR173 – ident: 39_CR153 doi: 10.1109/INFOCOM41043.2020.9155489 – ident: 39_CR64 doi: 10.1109/SC.2008.5213922 – volume: 28 start-page: 39 year: 2008 ident: 39_CR114 publication-title: IEEE Micro doi: 10.1109/MM.2008.31 – ident: 39_CR101 doi: 10.1145/1504176.1504212 – volume: 48 start-page: 80 year: 2020 ident: 39_CR29 publication-title: Int. J. Parallel Program. doi: 10.1007/s10766-019-00646-x – ident: 39_CR198 doi: 10.1145/1128022.1128027 – ident: 39_CR134 – ident: 39_CR165 doi: 10.1109/ICDAR.2005.251 – ident: 39_CR157 – ident: 39_CR112 doi: 10.1007/978-3-319-47099-3_10 – ident: 39_CR102 doi: 10.1109/ICPP.2013.35 – ident: 39_CR116 doi: 10.1145/3135974.3135984 – ident: 39_CR97 doi: 10.1145/2304576.2304623 – ident: 39_CR20 – ident: 39_CR140 – ident: 39_CR179 – ident: 39_CR59 – ident: 39_CR48 – volume: 72 start-page: 1117 year: 2012 ident: 39_CR63 publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2011.07.011 – ident: 39_CR69 doi: 10.1007/978-3-319-09967-5_5 – volume: 19 start-page: 1 year: 2019 ident: 39_CR160 publication-title: ACM Trans. Embed. Comput. – volume: 23 start-page: 777 year: 2004 ident: 39_CR24 publication-title: ACM Trans. Graph doi: 10.1145/1015706.1015800 – volume: 14 start-page: 563 year: 1967 ident: 39_CR95 publication-title: J. ACM (JACM) doi: 10.1145/321406.321418 – ident: 39_CR72 doi: 10.1145/3148173.3148185 – ident: 39_CR54 doi: 10.1109/ISPA/IUCC.2017.00131 – ident: 39_CR25 doi: 10.1109/CGO.2019.8661189 – volume: 36 start-page: 289 year: 2008 ident: 39_CR138 publication-title: Int. J. Parallel Program. doi: 10.1007/s10766-008-0072-7 – volume: 14 start-page: 1 year: 2017 ident: 39_CR40 publication-title: ACM Trans. Archit. Code Optim. (TACO) doi: 10.1145/3132710 – ident: 39_CR136 – ident: 39_CR159 doi: 10.1007/978-3-642-19595-2_10 – ident: 39_CR152 doi: 10.1145/2304576.2304585 – ident: 39_CR1 – volume: 182 start-page: 1093 year: 2011 ident: 39_CR78 publication-title: Comput. Phys. Commun. doi: 10.1016/j.cpc.2010.12.052 – ident: 39_CR68 – ident: 39_CR142 – ident: 39_CR7 – ident: 39_CR73 doi: 10.1109/LLVM-HPC.2014.9 – ident: 39_CR177 – ident: 39_CR191 doi: 10.1007/978-3-642-54807-9_9 – ident: 39_CR11 – ident: 39_CR131 – ident: 39_CR154 doi: 10.1109/INFOCOM.2017.8057087 – ident: 39_CR120 doi: 10.1109/TPDS.2015.2394802 – volume: 73 start-page: 1627 year: 2013 ident: 39_CR188 publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2013.07.013 – volume: 74 start-page: 3202 year: 2014 ident: 39_CR50 publication-title: J. Parallel Distrib. Comput. doi: 10.1016/j.jpdc.2014.07.003 – ident: 39_CR67 doi: 10.1109/ISPASS.2011.5762730 – ident: 39_CR123 doi: 10.1145/2145816.2145832 – ident: 39_CR141 doi: 10.1109/JPROC.2008.917757 – ident: 39_CR107 doi: 10.1109/SC.2010.36 – ident: 39_CR183 – ident: 39_CR6 – ident: 39_CR83 doi: 10.1145/2530268.2530269 – ident: 39_CR43 doi: 10.5281/zenodo.1244532 – ident: 39_CR172 – ident: 39_CR200 doi: 10.1145/2712386.2712405 – ident: 39_CR41 doi: 10.1109/SBAC-PAD.2017.11 – ident: 39_CR186 doi: 10.1109/CCGrid.2014.16 – ident: 39_CR106 doi: 10.1145/1504176.1504194 – ident: 39_CR53 doi: 10.1109/ICPP.2011.45 – ident: 39_CR117 doi: 10.1145/1201775.882362 – ident: 39_CR145 doi: 10.1109/ISSCC.2018.8310168 – volume: 40 start-page: 535 year: 2019 ident: 39_CR45 publication-title: Lobachevskii J. Math. doi: 10.1134/S1995080219050056 – ident: 39_CR81 – ident: 39_CR151 doi: 10.1145/2491956.2462176 – ident: 39_CR14 doi: 10.1109/SC.2008.5217926 – volume: 19 start-page: 1236 year: 2018 ident: 39_CR113 publication-title: Front. IT EE – ident: 39_CR204 doi: 10.1109/IPDPSW.2016.50 – ident: 39_CR171 doi: 10.1145/3078633.3081040 – volume: 46 start-page: 62 year: 2018 ident: 39_CR52 publication-title: Int. J. Parallel Program. doi: 10.1007/s10766-017-0490-5 – ident: 39_CR92 – ident: 39_CR158 – ident: 39_CR184 doi: 10.1007/978-3-540-89740-8_1 – volume: 41 start-page: 753 year: 2013 ident: 39_CR12 publication-title: Int. J. Parallel Program. doi: 10.1007/s10766-012-0211-z – volume: 39 start-page: 769 year: 2013 ident: 39_CR61 publication-title: Parallel Comput. doi: 10.1016/j.parco.2013.09.003 – ident: 39_CR197 doi: 10.1109/HiPC.2014.7116910 – volume: 29 start-page: 283 year: 2018 ident: 39_CR37 publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2017.2755657 – ident: 39_CR27 – volume: 49 start-page: 589 year: 2005 ident: 39_CR94 publication-title: IBM J. Res. Dev. doi: 10.1147/rd.494.0589 – ident: 39_CR108 doi: 10.1145/1815961.1816021 – ident: 39_CR170 doi: 10.1145/2909437.2909454 – ident: 39_CR147 – ident: 39_CR125 doi: 10.1109/CGO.2019.8661188 – ident: 39_CR86 – ident: 39_CR103 – ident: 39_CR149 – ident: 39_CR201 doi: 10.1109/ESTIMedia.2012.6507031 – ident: 39_CR51 doi: 10.1109/CGO.2013.6495010 – ident: 39_CR47 doi: 10.1145/1854273.1854318 – ident: 39_CR98 doi: 10.1109/CGO.2019.8661172 – ident: 39_CR132 – ident: 39_CR65 doi: 10.1109/IPDPS.2014.24 – ident: 39_CR110 doi: 10.1145/1735688.1735698 – ident: 39_CR19 doi: 10.1109/SC.2006.17 – ident: 39_CR55 doi: 10.1145/2568088.2576799 – ident: 39_CR150 doi: 10.1145/2185520.2185528 – volume: 44 start-page: 177 year: 2009 ident: 39_CR182 publication-title: ACM Sigplan Not. doi: 10.1145/1543135.1542496 – ident: 39_CR193 doi: 10.1109/JPROC.2018.2817118 – ident: 39_CR209 – ident: 39_CR87 doi: 10.1145/1941553.1941590 – volume: 62 start-page: 1023 year: 2012 ident: 39_CR39 publication-title: J. Supercomput. doi: 10.1007/s11227-012-0789-3 – ident: 39_CR90 – ident: 39_CR122 doi: 10.1145/3084540 – ident: 39_CR46 – ident: 39_CR207 doi: 10.1109/TPDS.2020.2978045 – ident: 39_CR96 doi: 10.1145/2807591.2807621 – ident: 39_CR129 doi: 10.1109/IPDPSW.2012.296 – volume: 28 start-page: 23 year: 2011 ident: 39_CR100 publication-title: IEEE Softw. doi: 10.1109/MS.2011.12 – ident: 39_CR16 doi: 10.1145/3293883.3302577 – ident: 39_CR84 – ident: 39_CR85 – ident: 39_CR176 – ident: 39_CR206 doi: 10.1145/3203217.3203244 – volume: 7 start-page: 139394 year: 2019 ident: 39_CR203 publication-title: IEEE Access doi: 10.1109/ACCESS.2019.2936620 – volume: 21 start-page: 173 year: 2011 ident: 39_CR49 publication-title: Parallel Process. Lett. doi: 10.1142/S0129626411000151 – ident: 39_CR79 – ident: 39_CR91 – ident: 39_CR56 – ident: 39_CR26 doi: 10.1109/IISWC.2009.5306797 – ident: 39_CR36 doi: 10.1109/CGO.2017.7863731 – volume: 51 start-page: 559 year: 2007 ident: 39_CR28 publication-title: IBM J. Res. Dev. doi: 10.1147/rd.515.0559 – ident: 39_CR119 doi: 10.1109/ICPADS.2011.48 – volume: 26 start-page: 10 year: 2006 ident: 39_CR71 publication-title: IEEE Micro doi: 10.1109/MM.2006.41 – ident: 39_CR208 doi: 10.1109/TPDS.2015.2442983 – ident: 39_CR148 – ident: 39_CR194 doi: 10.1145/1854273.1854313 – ident: 39_CR161 doi: 10.1145/3293320.3293338 – volume: 18 start-page: 1 year: 2010 ident: 39_CR23 publication-title: Sci. Program. – volume: 22 start-page: 78 year: 2011 ident: 39_CR76 publication-title: IEEE Trans. Parallel Distrib. Syst. doi: 10.1109/TPDS.2010.62 – ident: 39_CR17 – ident: 39_CR137 – ident: 39_CR168 doi: 10.1007/978-3-540-89740-8_2 – volume: 43 start-page: 752 year: 2015 ident: 39_CR93 publication-title: Int. J. Parallel Programm. doi: 10.1007/s10766-014-0320-y – ident: 39_CR75 doi: 10.1145/1513895.1513902 – ident: 39_CR205 doi: 10.1109/IPDPS.2018.00061 – ident: 39_CR143 – ident: 39_CR2 – ident: 39_CR126
SSID	ssj0002710226 ssib053822361
Score	2.369977
Snippet	Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core...
SourceID	proquest crossref springer
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	382
SubjectTerms	Algorithms Assembly language Communication Computation Computer Hardware Computer Science Computer Systems Organization and Communication Networks Linear algebra Machine learning Parallel programming Programmers Software Supercomputers Survey Paper
SummonAdditionalLinks	– databaseName: SpringerLINK Contemporary 1997-Present dbid: RSV link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JSwMxFA5aPXixrlitkoM3DcxkmUm8iSgepBRc8DbMZEGhi8y0Bf-9SZq0KCroeTJJeHlbyHvfB8ApyZmSqcKIZNQgipVCImME2csISSRlLDcexPUu7_X487Poh6awJla7xydJ76kXzW5Wu1KK3HXHd5QiugrWmEObcXf0-6eoRdaCcUQU8f4YuyDqedcwwxRRgXHonvl-2s8Rapl2fnkp9QHopv2_rW-BzZBwwsu5hmyDFT3aAe1I5gCDbe-Cfr-sHa_KAIaSraFdAXqinAbazBa-uMKZsdU3PZ42cGidCHIQmM0FLKErTK_1y7wYHjbTeqbf98DjzfXD1S0KdAtIWjt0pPSSl5IIbTCVOM2yRFRprphzCpIYxRPDSrt7w93TZiWtp7DBLcOJSriijJJ90BqNR_oAQGPTLMZIXglhaC4rLmmpOK1oageKTHdAGkVcyIBF7igxBsUCRdmLrLAiK7zICtoBZ4t_3uZIHL-O7saTK4JVNoVNbTi2G8BJB5zHk1p-_nm2w78NPwIb2B22r3rpgtaknupjsC5nk9emPvHa-gHCPN6r priority: 102 providerName: Springer Nature
Title	Parallel programming models for heterogeneous many-cores: a comprehensive survey
URI	https://link.springer.com/article/10.1007/s42514-020-00039-4 https://www.proquest.com/docview/2938245920
Volume	2
WOSCitedRecordID	wos000710561000008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
journalDatabaseRights	– providerCode: PRVPQU databaseName: Computer Science Database customDbUrl: eissn: 2524-4930 dateEnd: 20241214 omitProxy: false ssIdentifier: ssj0002710226 issn: 2524-4922 databaseCode: K7- dateStart: 20190501 isFulltext: true titleUrlDefault: http://search.proquest.com/compscijour providerName: ProQuest – providerCode: PRVPQU databaseName: ProQuest Central customDbUrl: eissn: 2524-4930 dateEnd: 20241214 omitProxy: false ssIdentifier: ssj0002710226 issn: 2524-4922 databaseCode: BENPR dateStart: 20190501 isFulltext: true titleUrlDefault: https://www.proquest.com/central providerName: ProQuest – providerCode: PRVAVX databaseName: SpringerLINK Contemporary 1997-Present customDbUrl: eissn: 2524-4930 dateEnd: 99991231 omitProxy: false ssIdentifier: ssj0002710226 issn: 2524-4922 databaseCode: RSV dateStart: 20190501 isFulltext: true titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22 providerName: Springer Nature
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELagZWDhIUAUSuWBDSwSx05iFgSoFQOqovJQtyixExWpL5K2Ev-es-u0AokuLFliO4rv_N3Zd74PoUsv4Eq6ihLPZzlhVCkifO4R2Ix4jmScB7kp4vocdLthvy8ie-BW2rTKChMNUKuJ1GfkN2CWQsq4oM7d9JNo1igdXbUUGtuo7lIAYR2UDUilT7CWaVVbxCAz1ebUMLBRThlhglJ7j8bcpgP1dRnR-ylzZZWwn7Zq7YD-ipkaU9TZ_-9PHKA964Ti-6XWHKKtbHyEoigpNLHKENucrRF8HBumnBKDa4sHOnNmAgqXTeYlHgGKEF0Ds7zFCdaZ6UU2WGbD43JeLLKvY_TWab8-PhHLt0AkLETNSi_DRHoiyymT1PV9R6RuoLhGBenlKnRynsDE5KGObaYSoAKsm08d5YSKceadoNp4Ms5OEc7Bz-LcC1IhchbINJQsUSFLmQsNhZ81kFvNbCxtMXLNiTGMV2WUjTRikEZspBGzBrpa9ZkuS3FsbN2sRBDbZVnG6_lvoOtKiOvXf492tnm0c7RLtd6YNJcmqs2KeXaBduRi9lEWLVR_aHejXssoJzx7L-_f3Qbk7w
linkProvider	ProQuest
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB7RpVJ7oVRt1S0vH9pTazWxx0lcqUKlgEBsV6uKStzSxHZEJdiFhAXxp_obO_YmrIoENw6c44zkzDcPxzPzAbyXqbImtoLLBCuOwlquEyU5HUZkZFCptApDXAfpcJgdHenRAvztemF8WWXnE4OjthPj_5F_prCUCVRaRJtn59yzRvnb1Y5CYwaLA3d9RUe25uv-Nun3gxC7O4ff93jLKsANwc1zr5usMFK7SqARcZJEuoxTqzz2jaxsFlWqiKSuMn-DVxoyCPLhiYhslFlUKEnuE1hEiYnqweLWznD0s0MweQ_RTTMJsUD4AB4434QSyFEL0XbuhP49MpgYuT_BhSZZjv9Hx3nKe-uWNgS_3ReP7bMtw1KbZrNvM7t4CQtu_ApGo6L21DEnrK1KO6XNssAF1DBK3tmxrw2akEm5ybRhp-QnuZ_y2XxhBfO197U7ntX7s2ZaX7rr1_DrQTbxBnrjydi9BVZRJqmUTEutK0xNmRksbIYlxrRQJ64PcafJ3LTj1j3rx0l-Myg6aD8n7edB-zn24ePNO2ezYSP3rl7tVJ63jqfJ5_ruw6cONPPHd0t7d7-0DXi2d_hjkA_2hwcr8Fx4zIainlXoXdRTtwZPzeXFn6Zeb02Cwe-HhtM_HR48_w
linkToPdf	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bS8MwFD7oFPHFecXp1Dz4psE2TXrxTdShKGN4w7fSJg0K7kK7Dfz3nmTtvKCC-Nw0TU_OLeSc7wPY9wKhpKsY9XyuKWdK0cgXHsXDiOdILkSgLYjrddBuh4-PUedDF7-tdq-uJCc9DQalqTc8Gih9NG18Q01zOTVHH9tdSvkszHE8yZiirpvbh0qj0JpZhS5ifTMzAdVysDHBOOURY2UnzffTfo5W7ynol1tTG4xa9f__xjIslYkoOZlozgrMZL1VqFckD6S0-TXodJLc8K28kLKUq4tfI5ZApyCY8ZInU1DTRz3M-qOCdNG5UAONWRyThJiC9Tx7mhTJk2KUj7PXdbhvnd-dXtCShoFKtE9DVi_DRHpRphmXzPV9J0rdQAnjLKSnVehokeDqdWiuPFOJHgSDns8c5YSKC-5tQK3X72WbQDSmX0J4QRpFmgcyDSVPVMhT7uLAyM8a4FbijmWJUW6oMl7iKbqyFVmMIoutyGLegIPpO4MJQsevo5vVLsaltRYxpjwhwwUwpwGH1a69P_55tq2_Dd-Dhc5ZK76-bF9twyIz-24LY5pQG-ajbAfm5Xj4XOS7VonfABrc6nM
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+programming+models+for+heterogeneous+many-cores%3A+a+comprehensive+survey&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=Fang%2C+Jianbin&rft.au=Huang%2C+Chun&rft.au=Tang%2C+Tao&rft.au=Wang%2C+Zheng&rft.date=2020-12-01&rft.pub=Springer+Nature+B.V&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=2&rft.issue=4&rft.spage=382&rft.epage=400&rft_id=info:doi/10.1007%2Fs42514-020-00039-4
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon