Parallel programming models for heterogeneous many-cores: a comprehensive survey

Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitabl...

Celý popis

Uložené v:
Podrobná bibliografia
Vydané v:CCF transactions on high performance computing (Online) Ročník 2; číslo 4; s. 382 - 400
Hlavní autori: Fang, Jianbin, Huang, Chun, Tang, Tao, Wang, Zheng
Médium: Journal Article
Jazyk:English
Vydavateľské údaje: Singapore Springer Singapore 01.12.2020
Springer Nature B.V
Predmet:
ISSN:2524-4922, 2524-4930
On-line prístup:Získať plný text
Tagy: Pridať tag
Žiadne tagy, Buďte prvý, kto otaguje tento záznam!
Abstract Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
AbstractList Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core design offers the potential for energy-efficient high-performance, such potential can only be unlocked if the application programs are suitably parallel and can be made to match the underlying heterogeneous platform. In this article, we provide a comprehensive survey for parallel programming models for heterogeneous many-core architectures and review the compiling techniques of improving programmability and portability. We examine various software optimization techniques for minimizing the communicating overhead between heterogeneous computing devices. We provide a road map for a wide variety of different research areas. We conclude with a discussion on open issues in the area and potential research directions. This article provides both an accessible introduction to the fast-moving area of heterogeneous programming and a detailed bibliography of its main achievements.
Author Fang, Jianbin
Tang, Tao
Huang, Chun
Wang, Zheng
Author_xml – sequence: 1
  givenname: Jianbin
  surname: Fang
  fullname: Fang, Jianbin
  organization: Institute for Computer Systems, College of Computer, National University of Defense Technology
– sequence: 2
  givenname: Chun
  surname: Huang
  fullname: Huang, Chun
  email: chunhuang@nudt.edu.cn
  organization: Institute for Computer Systems, College of Computer, National University of Defense Technology
– sequence: 3
  givenname: Tao
  surname: Tang
  fullname: Tang, Tao
  organization: Institute for Computer Systems, College of Computer, National University of Defense Technology
– sequence: 4
  givenname: Zheng
  surname: Wang
  fullname: Wang, Zheng
  organization: School of Computing, University of Leeds
BookMark eNp9kE1LAzEQhoNUsNb-AU8Lnlcnk-yXNyl-QcEe9BzS7Gy7ZTepybbQf-_WFQUPPc3AvO_MO88lG1lnibFrDrccILsLEhMuY0CIAUAUsTxjY0xQxrIQMPrtES_YNIRNL8KMA2I6ZouF9rppqIm23q28btvarqLWldSEqHI-WlNH_YQsuV2IWm0PsXGewn2kI-Parac12VDvKQo7v6fDFTuvdBNo-lMn7OPp8X32Es_fnl9nD_PYCF50MRcm10YUVKE0yNMUiiXPygQyEEZUZQ5VovtfqjwvuFgaAQnHPEUoIS9lIsWE3Qx7-9yfOwqd2ridt_1JhYXIUSYFQq_KB5XxLgRPlTJ1p7va2c7rulEc1BGhGhCqHqH6RqiOB_CfdevrVvvDaZMYTKEX2xX5v1QnXF-9aISq
CitedBy_id crossref_primary_10_1007_s41365_024_01434_0
crossref_primary_10_1007_s42514_023_00148_w
crossref_primary_10_1002_cpe_8014
crossref_primary_10_1109_TSUSC_2023_3314916
crossref_primary_10_1145_3718987
crossref_primary_10_1093_comjnl_bxac017
crossref_primary_10_2139_ssrn_5089085
crossref_primary_10_1371_journal_pone_0250306
crossref_primary_10_1145_3764664
crossref_primary_10_1145_3485008
crossref_primary_10_3390_electronics14061191
crossref_primary_10_1109_ACCESS_2024_3364672
crossref_primary_10_1007_s10766_023_00758_5
crossref_primary_10_1002_cpe_8318
crossref_primary_10_1007_s11227_025_07734_5
crossref_primary_10_1007_s11227_023_05679_1
crossref_primary_10_1007_s11227_025_07295_7
crossref_primary_10_1016_j_ypmed_2023_107603
crossref_primary_10_3390_computation11050097
crossref_primary_10_1007_s11227_024_06394_1
crossref_primary_10_3390_math13071131
crossref_primary_10_1007_s11081_023_09845_5
crossref_primary_10_1016_j_sysarc_2021_102159
crossref_primary_10_3390_computers13100273
crossref_primary_10_1007_s00366_024_01951_x
crossref_primary_10_1002_cpe_6260
crossref_primary_10_1007_s42514_023_00174_8
crossref_primary_10_1007_s42514_021_00063_y
crossref_primary_10_5194_gmd_18_905_2025
crossref_primary_10_1631_FITEE_2200359
Cites_doi 10.1109/CASES.2013.6662510
10.1145/2150976.2151013
10.1007/978-3-319-43659-3_33
10.1109/IPDPS.2012.59
10.1109/JPROC.2018.2841200
10.1109/IPDPSW.2013.236
10.1145/3078155.3078187
10.1007/s10766-017-0497-y
10.1145/2159430.2159431
10.1109/IPDPS.2012.60
10.1145/2400682.2400713
10.1007/978-3-642-11970-5_14
10.1145/1201775.882363
10.1145/2677036
10.1109/IPDPSW.2016.99
10.1145/1964218.1964221
10.1109/SC.2016.50
10.1007/978-3-642-03869-3_79
10.1007/978-3-319-17473-0_10
10.1002/cpe.4664
10.1109/TPDS.2007.70811
10.1007/s11227-014-1213-y
10.1109/IPDPSW.2016.217
10.1109/IPDPSW.2010.5470823
10.1145/1995896.1995932
10.1145/1854273.1854313
10.1145/3281411.3281422
10.1145/1375581.1375596
10.1109/ISPASS.2010.5452013
10.1109/IPDPS.2011.269
10.1007/978-3-642-45293-2_13
10.1016/j.parco.2009.12.005
10.1145/1366230.1366234
10.1142/S0129626416400028
10.1016/B978-0-12-385963-1.00026-5
10.1007/978-3-319-46079-6_2
10.1109/PACT.2011.60
10.1177/1094342015585845
10.1109/CGO.2013.6494993
10.1007/978-3-642-40047-6_86
10.1109/IPDPSW.2012.226
10.1109/PACT.2017.24
10.1145/2872362.2872411
10.1145/2579561
10.1145/2512436
10.5281/zenodo.571466
10.1145/2688500.2688505
10.1109/INFOCOM41043.2020.9155489
10.1109/SC.2008.5213922
10.1109/MM.2008.31
10.1145/1504176.1504212
10.1007/s10766-019-00646-x
10.1145/1128022.1128027
10.1109/ICDAR.2005.251
10.1007/978-3-319-47099-3_10
10.1109/ICPP.2013.35
10.1145/3135974.3135984
10.1145/2304576.2304623
10.1016/j.jpdc.2011.07.011
10.1007/978-3-319-09967-5_5
10.1145/1015706.1015800
10.1145/321406.321418
10.1145/3148173.3148185
10.1109/ISPA/IUCC.2017.00131
10.1109/CGO.2019.8661189
10.1007/s10766-008-0072-7
10.1145/3132710
10.1007/978-3-642-19595-2_10
10.1145/2304576.2304585
10.1016/j.cpc.2010.12.052
10.1109/LLVM-HPC.2014.9
10.1007/978-3-642-54807-9_9
10.1109/INFOCOM.2017.8057087
10.1109/TPDS.2015.2394802
10.1016/j.jpdc.2013.07.013
10.1016/j.jpdc.2014.07.003
10.1109/ISPASS.2011.5762730
10.1145/2145816.2145832
10.1109/JPROC.2008.917757
10.1109/SC.2010.36
10.1145/2530268.2530269
10.5281/zenodo.1244532
10.1145/2712386.2712405
10.1109/SBAC-PAD.2017.11
10.1109/CCGrid.2014.16
10.1145/1504176.1504194
10.1109/ICPP.2011.45
10.1145/1201775.882362
10.1109/ISSCC.2018.8310168
10.1134/S1995080219050056
10.1145/2491956.2462176
10.1109/SC.2008.5217926
10.1109/IPDPSW.2016.50
10.1145/3078633.3081040
10.1007/s10766-017-0490-5
10.1007/978-3-540-89740-8_1
10.1007/s10766-012-0211-z
10.1016/j.parco.2013.09.003
10.1109/HiPC.2014.7116910
10.1109/TPDS.2017.2755657
10.1147/rd.494.0589
10.1145/1815961.1816021
10.1145/2909437.2909454
10.1109/CGO.2019.8661188
10.1109/ESTIMedia.2012.6507031
10.1109/CGO.2013.6495010
10.1145/1854273.1854318
10.1109/CGO.2019.8661172
10.1109/IPDPS.2014.24
10.1145/1735688.1735698
10.1109/SC.2006.17
10.1145/2568088.2576799
10.1145/2185520.2185528
10.1145/1543135.1542496
10.1109/JPROC.2018.2817118
10.1145/1941553.1941590
10.1007/s11227-012-0789-3
10.1145/3084540
10.1109/TPDS.2020.2978045
10.1145/2807591.2807621
10.1109/IPDPSW.2012.296
10.1109/MS.2011.12
10.1145/3293883.3302577
10.1145/3203217.3203244
10.1109/ACCESS.2019.2936620
10.1142/S0129626411000151
10.1109/IISWC.2009.5306797
10.1109/CGO.2017.7863731
10.1147/rd.515.0559
10.1109/ICPADS.2011.48
10.1109/MM.2006.41
10.1109/TPDS.2015.2442983
10.1145/3293320.3293338
10.1109/TPDS.2010.62
10.1007/978-3-540-89740-8_2
10.1007/s10766-014-0320-y
10.1145/1513895.1513902
10.1109/IPDPS.2018.00061
ContentType Journal Article
Copyright China Computer Federation (CCF) 2020
China Computer Federation (CCF) 2020.
Copyright_xml – notice: China Computer Federation (CCF) 2020
– notice: China Computer Federation (CCF) 2020.
DBID AAYXX
CITATION
8FE
8FG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
GNUQQ
HCIFZ
JQ2
K7-
P62
PHGZM
PHGZT
PKEHL
PQEST
PQGLB
PQQKQ
PQUKI
DOI 10.1007/s42514-020-00039-4
DatabaseName CrossRef
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central UK/Ireland
Advanced Technologies & Computer Science Collection
ProQuest Central Essentials - QC
ProQuest Central
ProQuest Technology Collection
ProQuest One
ProQuest Central
ProQuest Central Student
SciTech Premium Collection
ProQuest Computer Science Collection
Computer Science Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic (retired)
ProQuest One Academic UKI Edition
DatabaseTitle CrossRef
Advanced Technologies & Aerospace Collection
Computer Science Database
ProQuest Central Student
Technology Collection
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
ProQuest One Academic Eastern Edition
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central
ProQuest One Applied & Life Sciences
ProQuest One Academic UKI Edition
ProQuest Central Korea
ProQuest Central (New)
ProQuest One Academic
ProQuest One Academic (New)
DatabaseTitleList Advanced Technologies & Aerospace Collection

Database_xml – sequence: 1
  dbid: BENPR
  name: ProQuest Central
  url: https://www.proquest.com/central
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2524-4930
EndPage 400
ExternalDocumentID 10_1007_s42514_020_00039_4
GroupedDBID -EM
0R~
406
AACDK
AAHNG
AAJBT
AASML
AATNV
AAUYE
ABAKF
ABDZT
ABECU
ABFTV
ABJNI
ABKCH
ABMQK
ABTEG
ABTKH
ABTMW
ABXPI
ACAOD
ACDTI
ACHSB
ACMLO
ACOKC
ACPIV
ACZOJ
ADKNI
ADTPH
ADURQ
ADYFF
AEFQL
AEJRE
AEMSY
AESKC
AFBBN
AFKRA
AFQWF
AGDGC
AGJBK
AGMZJ
AGQEE
AGRTI
AIGIU
AILAN
AITGF
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
AMKLP
AMXSW
AMYLF
ARAPS
AXYYD
BENPR
BGLVJ
BGNMA
CCPQU
DPUIP
EBLON
EBS
EJD
FIGPU
FINBP
FNLPD
FSGXE
GGCAI
H13
HCIFZ
IKXTQ
IWAJR
J-C
JZLTJ
K7-
KOV
LLZTM
M4Y
NPVJJ
NQJWS
NU0
PT4
ROL
RSV
SJYHP
SNE
SNPRN
SOHCF
SOJ
SRMVM
SSLCW
STPWE
TSG
UOJIU
UTJUX
VEKWB
VFIZW
ZMTXR
AAYXX
ABBRH
ABDBE
ABFSG
ABRTQ
ACSTC
AEZWR
AFDZB
AFFHD
AFHIU
AFOHR
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
PQGLB
8FE
8FG
AZQEC
DWQXO
GNUQQ
JQ2
P62
PKEHL
PQEST
PQQKQ
PQUKI
ID FETCH-LOGICAL-c319t-13c8ac39ef24c216609b17d50703c3fd80f5a039f88913bc305128620d08d4543
IEDL.DBID K7-
ISICitedReferencesCount 36
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000710561000008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 2524-4922
IngestDate Sat Nov 08 14:44:00 EST 2025
Sat Nov 29 04:01:15 EST 2025
Tue Nov 18 21:58:16 EST 2025
Fri Feb 21 02:45:29 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords Heterogeneous computing
Parallel programming models
Many-core architectures
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c319t-13c8ac39ef24c216609b17d50703c3fd80f5a039f88913bc305128620d08d4543
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
PQID 2938245920
PQPubID 6587180
PageCount 19
ParticipantIDs proquest_journals_2938245920
crossref_citationtrail_10_1007_s42514_020_00039_4
crossref_primary_10_1007_s42514_020_00039_4
springer_journals_10_1007_s42514_020_00039_4
PublicationCentury 2000
PublicationDate 20201200
2020-12-00
20201201
PublicationDateYYYYMMDD 2020-12-01
PublicationDate_xml – month: 12
  year: 2020
  text: 20201200
PublicationDecade 2020
PublicationPlace Singapore
PublicationPlace_xml – name: Singapore
– name: Beijing
PublicationTitle CCF transactions on high performance computing (Online)
PublicationTitleAbbrev CCF Trans. HPC
PublicationYear 2020
Publisher Springer Singapore
Springer Nature B.V
Publisher_xml – name: Springer Singapore
– name: Springer Nature B.V
References Introducing rdna architecture.: Tech. rep., AMD Corporation (2019)
Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT ’18 (2018)
Kim, J., et al.: Translating openmp device constructs to opencl using unnecessary data transfer elimination. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2016)
Marqués, R., et al.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science (2013)
Szuppe, J.: Boost.compute: A parallel computing library for C++ based on opencl. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL (2016)
Yang, C., et al.: O2render: An opencl-to-renderscript translator for porting across various GPUs or CPUs. In: IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, ESTIMedia (2012)
Bodin, F., Romain, D., Colin De Verdiere, G.: One OpenCL to Rule Them All? In: International Workshop on Multi-/Many-core Computing Systems, MuCoCoS (2013)
Wen, Y., et al.: Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In: HiPC (2014)
The El Capitan Supercomputer.: https://www.cray.com/company/customers/lawrence-livermore-national-lab (2020)
TI’s OpenCL Implementation.: https://git.ti.com/cgit/opencl (2020)
ChenDCharacterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+Int. J. Parallel Program.202048809710.1007/s10766-019-00646-x
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pp. 8024–8035 (2019)
BaeHThe cetus source-to-source compiler infrastructure: overview and evaluationInt. J. Parallel Program.20134175376710.1007/s10766-012-0211-z
Unat, D., et al.: Mint: realizing CUDA performance in 3d stencil methods with annotated C. In: Proceedings of the 25th International Conference on Supercomputing, (2011)
Heller, T., et al.: Closing the performance gap with modern C++. In: High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P∧\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\wedge$$\end{document}3MA, VHPC, WOPSSS (2016)
Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT (2017)
Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT (2010)
He, J., et al.: Openmdsp: Extending openmp to program multi-core DSP. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT (2011)
Intel’s OneAPI.: https://software.intel.com/en-us/oneapi (2020)
Beignet OpenCL.: https://www.freedesktop.org/wiki/ Software/Beignet/ (2020)
Nvidia’s next generation cuda compute architecture.: Fermi. NVIDIA Corporation, Tech. rep. (2009)
Boyer, M., et al.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013)
Fang, J., et al.: Implementing and evaluating opencl on an armv8 multi-core CPU. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) (2017)
Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC (2014)
Rudy, G., et al.: A programming language interface to describe transformations and code generation. In: Languages and Compilers for Parallel Computing - 23rd International Workshop, LCPC (2010)
HIP.: Heterogeneous-Compute Interface for Portability. https://github.com/RadeonOpenCompute/hcc (2020)
de Fine Licht, J., Hoefler, T.: hlslib: Software engineering for hardware design. CoRR (2019)
Mendonca, G.S.D., et al.: Dawncc: Automatic annotation for data parallelism and offloading. TACO (2017)
Wang, Z.: Machine learning based mapping of data and streaming parallelism to multi-cores. Ph.D. thesis, University of Edinburgh (2011)
Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2015)
Mishra, A., et al.: Kernel fusion/decomposition for automatic gpu-offloading. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
HLSL.: The High Level Shading Language for DirectX. https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl (2018)
de Carvalho Moreira, W., et al.: Exploring heterogeneous mobile architectures with a high-level programming model. In: 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2017)
Green500 Supercomputers.: https://www.top500.org/green500/ (2020)
ErnstssonASkepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systemsInt. J. Parallel Program.201846628010.1007/s10766-017-0490-5
Grewe, D., et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013b)
You, Y., et al.: Virtcl: a framework for opencl device abstraction and management. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP (2015)
Williams, S., et al.: The potential of the cell processor for scientific computing. In: Proceedings of the Third Conference on Computing Frontiers (2006)
Lee, S., Eigenmann, R.: Openmp to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009)
O’BrienKSupporting openmp on cellInt. J. Parallel Program.20083628931110.1007/s10766-008-0072-7
Ren, J., et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)
The Tianhe-2 Supercomputer.: https://top500.org/system/177999 (2020)
Baskaran, M.M., et al.: Automatic c-to-cuda code generation for affine programs. In: R. Gupta (ed.) 19th International Conference on Compiler Construction (CC) (2010)
DemidovDAmgcl: an efficient, flexible, and extensible algebraic multigrid implementationLobachevskii J. Math.201940535546397650010.1134/S1995080219050056
The Aurora Supercomputer.: https://aurora.alcf.anl.gov/ (2020)
GschwindMSynergistic processing in cell’s multicore architectureIEEE Micro200626102410.1109/MM.2006.41
Nvidia turing gpu architecture.: Tech. rep., NVIDIA Corporation (2018)
Amini, M., et al.: Static compilation analysis for host-accelerator communication optimization. In: Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC (2011)
Kim, J., et al.: Bridging opencl and CUDA: a comparative analysis and translation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015)
MPI.: Message Passing Interface. https://computing.llnl.gov/tutorials/mpi/ (2020)
The OpenCL Conformance Tests.: https://github.com/KhronosGroup/OpenCL-CTS (2020)
LindholmENVIDIA tesla: a unified graphics and computing architectureIEEE Micro200828395510.1109/MM.2008.31
Chandrasekhar, A., et al.: IGC: the open source intel graphics compiler. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
Intel Inc.: hStreams Architecture for MPSS 3.5 (2015)
Kim, Y., et al.: Translating CUDA to opencl for hardware generation using neural machine translation. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
Nugteren, C., Corporaal, H.: Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU (2012)
FangJEvaluating multiple streams on heterogeneous platformsParallel Process. Lett.20162641640002358509810.1142/S0129626416400028
Gregory, K., Miller, A.: C++ AMP: accelerated massive parallelism with microsoft visual C++ (2012)
Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC (2010)
Gregg, C., et al.: Where is the data? why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2011)
OpenCL.: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/ (2020)
ChenTCell broadband engine architecture and its first implementation—a performance viewIBM J. Res. Dev.20075155957210.1147/rd.515.0559
Bellens, P., et al.: Cellss: a programming model for the cell BE architecture. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (2006)
Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM (2014)
High-level abstractions for performance.: Portability and continuity of scientific software on future computing systems. University of Oxford, Tech. rep. (2014)
Cole, M.I.: Algorithmic skeletons: structured management of parallel computation (1989)
Trevett, N.: Opencl, sycl and spir—the next steps. Tech. rep, OpenCL Working Group (2019)
Hong, S., et al.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2011)
DuranAOmpss: a proposal for programming heterogeneous multi-core architecturesParallel Process. Lett.201121173193281200010.1142/S0129626411000151
Lepley, T., et al.: A novel compilation approach for image processing graphs on a
V Sanz Marco (39_CR160) 2019; 19
39_CR8
39_CR9
P Jääskeläinen (39_CR93) 2015; 43
39_CR181
39_CR184
39_CR183
39_CR186
39_CR185
39_CR6
39_CR7
39_CR187
39_CR4
39_CR5
39_CR2
39_CR3
MJ Harvey (39_CR78) 2011; 182
39_CR1
X Liao (39_CR113) 2018; 19
39_CR191
39_CR190
39_CR193
MK Gardner (39_CR61) 2013; 39
39_CR192
39_CR195
39_CR194
39_CR197
39_CR196
39_CR199
39_CR198
D Chen (39_CR29) 2020; 48
K O’Brien (39_CR138) 2008; 36
N Bell (39_CR18) 2012
39_CR99
G Tournavitis (39_CR182) 2009; 44
T Chen (39_CR28) 2007; 51
39_CR96
39_CR97
39_CR98
39_CR91
39_CR92
HC Edwards (39_CR50) 2014; 74
DP Scarpazza (39_CR162) 2008; 19
M Viñas (39_CR189) 2018; 20
AR Brodtkorb (39_CR23) 2010; 18
S Tomov (39_CR180) 2010; 36
39_CR207
39_CR206
39_CR209
39_CR208
M Haidl (39_CR74) 2018; 46
M Gschwind (39_CR71) 2006; 26
A Duran (39_CR49) 2011; 21
39_CR201
39_CR200
39_CR202
39_CR205
L Yuan (39_CR203) 2019; 7
39_CR204
TD Han (39_CR76) 2011; 22
I Buck (39_CR24) 2004; 23
L Seiler (39_CR163) 2009; 27
39_CR108
39_CR107
H Bae (39_CR12) 2013; 41
39_CR109
E Lindholm (39_CR114) 2008; 28
M Viñas (39_CR188) 2013; 73
39_CR48
39_CR44
39_CR46
39_CR47
39_CR41
39_CR42
39_CR102
39_CR43
39_CR101
39_CR104
39_CR103
39_CR106
39_CR105
TT Dao (39_CR37) 2018; 29
D De Sensi (39_CR40) 2017; 14
39_CR119
RM Karp (39_CR95) 1967; 14
39_CR118
39_CR38
39_CR33
39_CR34
39_CR35
39_CR36
D Demidov (39_CR45) 2019; 40
39_CR111
39_CR30
39_CR110
39_CR31
39_CR32
39_CR112
39_CR115
M Steuwer (39_CR167) 2014; 69
39_CR117
39_CR116
39_CR129
39_CR26
39_CR27
39_CR22
39_CR120
39_CR25
39_CR122
39_CR121
39_CR20
39_CR124
39_CR21
39_CR123
39_CR126
39_CR125
39_CR128
39_CR127
39_CR19
39_CR15
39_CR16
39_CR17
39_CR11
39_CR13
39_CR131
39_CR14
39_CR130
39_CR133
39_CR132
39_CR135
39_CR10
39_CR134
39_CR137
39_CR136
J Gómez-Luna (39_CR63) 2012; 72
39_CR139
JA Kahle (39_CR94) 2005; 49
39_CR90
39_CR88
39_CR140
39_CR89
39_CR142
39_CR141
39_CR84
39_CR144
39_CR85
39_CR143
39_CR86
39_CR146
39_CR87
39_CR145
39_CR80
39_CR148
39_CR81
39_CR147
39_CR82
39_CR83
39_CR149
A Ernstsson (39_CR52) 2018; 46
NE Davis (39_CR39) 2012; 62
39_CR77
39_CR151
39_CR150
39_CR79
39_CR153
39_CR152
39_CR73
39_CR155
39_CR154
39_CR75
39_CR157
39_CR156
39_CR159
J Fang (39_CR57) 2016; 26
39_CR70
39_CR158
39_CR72
39_CR66
39_CR67
39_CR161
39_CR68
39_CR164
39_CR69
39_CR62
39_CR166
39_CR165
39_CR64
39_CR168
39_CR65
39_CR169
39_CR60
W Kim (39_CR100) 2011; 28
39_CR59
39_CR171
39_CR170
39_CR55
39_CR173
39_CR56
39_CR172
39_CR175
39_CR58
39_CR174
39_CR51
39_CR177
39_CR176
39_CR53
39_CR179
39_CR54
39_CR178
References_xml – reference: Nvidia’s next generation cuda compute architecture.: Kepler tm gk110/210. NVIDIA Corporation, Tech. rep. (2014)
– reference: Sathre, P., et al.: On the portability of cpu-accelerated applications via automated source-to-source translation. In: Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region, HPC Asia (2019)
– reference: Demidov, D., et al.: ddemidov/vexcl: 1.4.1 (2017). https://doi.org/10.5281/zenodo.571466
– reference: Haidl, M., et al.: Pacxxv2 + RV: an llvm-based portable high-performance programming model. In: Proceedings of the Fourth Workshop on the LLVM Compiler Infrastructure in HPC, LLVM-HPC@SC (2017)
– reference: Harris, M.J., et al.: Simulation of cloud dynamics on graphics hardware. In: Proceedings of the 2003 ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware (2003)
– reference: The Tianhe-2 Supercomputer.: https://top500.org/system/177999 (2020)
– reference: “vega” instruction set architecture.: Tech. rep., AMD Corporation (2017)
– reference: HanTDhicuda: high-level GPGPU programmingIEEE Trans. Parallel Distrib. Syst.201122789010.1109/TPDS.2010.62
– reference: BrodtkorbARState-of-the-art in heterogeneous computingSci. Program.201018133
– reference: Stratton, J.A., et al.: MCUDA: an efficient implementation of CUDA kernels for multi-core CPUs. In: Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008)
– reference: Breitbart, J., Fohry, C.: Opencl: an effective programming model for data parallel computations at the cell broadband engine. In: 24th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2010)
– reference: Ciechanowicz, P., et al.: The münster skeleton library muesli: a comprehensive overview. Working Papers, ERCIS-European Research Center for Information Systems, No. 7 (2009)
– reference: LiaoXMoving from exascale to zettascale computing: challenges and techniquesFront. IT EE20181912361244
– reference: Komoda, T., et al.: Integrating multi-gpu execution in an openacc compiler. In: 42nd International Conference on Parallel Processing, ICPP (2013)
– reference: He, J., et al.: Openmdsp: Extending openmp to program multi-core DSP. In: 2011 International Conference on Parallel Architectures and Compilation Techniques, PACT (2011)
– reference: SteuwerMGorlatchSSkelcl: a high-level extension of opencl for multi-gpu systemsJ. Supercomput.201469232510.1007/s11227-014-1213-y
– reference: ZiiLABS OpenCL.: http://www.ziilabs.com/products/ software/opencl.php (2020)
– reference: EdwardsHCKokkos: Enabling manycore performance portability through polymorphic memory access patternsJ. Parallel Distrib. Comput.2014743202321610.1016/j.jpdc.2014.07.003
– reference: Nvidia turing gpu architecture.: Tech. rep., NVIDIA Corporation (2018)
– reference: Verdoolaege, S., et al.: Polyhedral parallel code generation for CUDA. ACM TACO (2013)
– reference: Bodin, F., Romain, D., Colin De Verdiere, G.: One OpenCL to Rule Them All? In: International Workshop on Multi-/Many-core Computing Systems, MuCoCoS (2013)
– reference: Nomizu, T., et al.: Implementation of xcalablemp device acceleration extention with opencl. In: 26th IEEE International Parallel and Distributed Processing Symposium Workshops & PhD Forum, IPDPSWP (2012)
– reference: Dastgeer, U., et al.: Adaptive implementation selection in the skepu skeleton programming library. In: Advanced Parallel Processing Technologies—10th International Symposium, APPT (2013)
– reference: DemidovDAmgcl: an efficient, flexible, and extensible algebraic multigrid implementationLobachevskii J. Math.201940535546397650010.1134/S1995080219050056
– reference: Membarth, R., et al.: Hipacc\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$^{\text{cc}}$$\end{document}: A domain-specific language and compiler for image processing. IEEE Trans. Parallel Distrib. Syst (2016)
– reference: Abadi, M., et al.: Tensorflow: Large-scale machine learning on heterogeneous distributed systems. CoRR (2016)
– reference: Mendonca, G.S.D., et al.: Dawncc: Automatic annotation for data parallelism and offloading. TACO (2017)
– reference: Wang, Z., et al.: Exploitation of GPUs for the parallelisation of probably parallel legacy code. In: CC ’14 (2014a)
– reference: Qualcomm snapdragon mobile platform opencl general programming and optimization.: Tech. rep., Qualcomm Corporation (2017)
– reference: Owens, J.D., et al.: A survey of general-purpose computation on graphics hardware. In: Eurographics, pp. 21–51 (2005)
– reference: TI’s OpenCL Implementation.: https://git.ti.com/cgit/opencl (2020)
– reference: Kim, J., et al.: Bridging opencl and CUDA: a comparative analysis and translation. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2015)
– reference: Leung, A., et al.: A mapping path for multi-gpgpu accelerated computers from a portable high level programming abstraction. In: Proceedings of 3rd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2010)
– reference: Lee, V.W., et al.: Debunking the 100x GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU. In: 37th International Symposium on Computer Architecture, ISCA (2010)
– reference: Meswani, M.R., et al.: Modeling and predicting performance of high performance computing applications on hardware accelerators. IJHPCA (2013)
– reference: You, Y., et al.: Virtcl: a framework for opencl device abstraction and management. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP (2015)
– reference: Lee, S., Eigenmann, R.: Openmp to GPGPU: a compiler framework for automatic translation and optimization. In: Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009)
– reference: BellNHoberockJMeiWHwuWChapter 26-thrust: a productivity-oriented library for cudaGPU Computing Gems Jade Edition, Applications of GPU Computing Series2012BurlingtonMorgan Kaufmann35937110.1016/B978-0-12-385963-1.00026-5
– reference: Grasso, I., et al.: Energy efficient HPC on embedded socs: optimization techniques for mali GPU. In: 2014 IEEE 28th International Parallel and Distributed Processing Symposium, IPDPS (2014)
– reference: HIP.: Heterogeneous-Compute Interface for Portability. https://github.com/RadeonOpenCompute/hcc (2020)
– reference: DaoTTLeeJAn auto-tuner for opencl work-group size on GPUsIEEE Trans. Parallel Distrib. Syst.20182928329610.1109/TPDS.2017.2755657
– reference: Gregg, C., et al.: Where is the data? why you cannot debate CPU vs. GPU performance without the answer. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2011)
– reference: MPI.: Message Passing Interface. https://computing.llnl.gov/tutorials/mpi/ (2020)
– reference: The Aurora Supercomputer.: https://aurora.alcf.anl.gov/ (2020)
– reference: Demidov, D., et al.: ddemidov/amgcl: 1.2.0 (2018). https://doi.org/10.5281/zenodo.1244532
– reference: Green500 Supercomputers.: https://www.top500.org/green500/ (2020)
– reference: Zhang, P., et al.: Auto-tuning streamed applications on intel xeon phi. In: 2018 IEEE International Parallel and Distributed Processing Symposium, IPDPS (2018a)
– reference: The El Capitan Supercomputer.: https://www.cray.com/company/customers/lawrence-livermore-national-lab (2020)
– reference: Membarth, R., et al.: Generating device-specific GPU code for local operators in medical imaging. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS (2012)
– reference: Wang, Z., O’Boyle, M.: Machine learning in compiler optimisation. In: Proceedings of IEEE (2018)
– reference: Newburn, C.J., et al.: Heterogeneous streaming. In: IPDPSW (2016)
– reference: SeilerLLarrabee: a many-core x86 architecture for visual computingIEEE Micro200927115
– reference: Zhang, P., et al.: MOCL: an efficient opencl implementation for the matrix-2000 architecture. In: Proceedings of the 15th ACM International Conference on Computing Frontiers, CF (2018bb)
– reference: PIPS.: Automatic Parallelizer and Code Transformation Framework. https://pips4u.org/ (2020)
– reference: Owens, J.D., et al.: GPU computing. Proceedings of the IEEE (2008)
– reference: Baskaran, M.M., et al.: Automatic c-to-cuda code generation for affine programs. In: R. Gupta (ed.) 19th International Conference on Compiler Construction (CC) (2010)
– reference: Krüger, J.H., Westermann, R.: Linear algebra operators for GPU implementation of numerical algorithms. ACM Trans. Graph (2003)
– reference: Martinez, G., et al.: CU2CL: A cuda-to-opencl translator for multi- and many-core architectures. In: 17th IEEE International Conference on Parallel and Distributed Systems, ICPADS (2011)
– reference: Hong, S., et al.: Accelerating CUDA graph algorithms at maximum warp. In: Proceedings of the 16th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2011)
– reference: Williams, S., et al.: The potential of the cell processor for scientific computing. In: Proceedings of the Third Conference on Computing Frontiers (2006)
– reference: van Werkhoven, B., et al.: Performance models for CPU-GPU data transfers. In: 14th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) (2014)
– reference: Li, Z., et al.: Evaluating the performance impact of multiple streams on the mic-based heterogeneous platform. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016a)
– reference: DavisNEParadigmatic shifts for exascale supercomputingJ. Supercomput.2012621023104410.1007/s11227-012-0789-3
– reference: GalliumCompute.: https://dri.freedesktop.org/wiki /GalliumCompute/ (2020)
– reference: ROCm Runtime.: https://github.com/RadeonOpenCompute /ROCR-Runtime (2020)
– reference: Nugteren, C., Corporaal, H.: Introducing ’bones’: a parallelizing source-to-source compiler based on algorithmic skeletons. In: The 5th Annual Workshop on General Purpose Processing with Graphics Processing Units, GPGPU (2012)
– reference: Trevett, N.: Opencl, sycl and spir—the next steps. Tech. rep, OpenCL Working Group (2019)
– reference: LindholmENVIDIA tesla: a unified graphics and computing architectureIEEE Micro200828395510.1109/MM.2008.31
– reference: Amd cal programming guide v2.0.: Tech. rep., AMD Corporation (2010)
– reference: Giles, M.B., et al.: Performance analysis of the OP2 framework on many-core architectures. SIGMETRICS Performance Evaluation Review (2011)
– reference: High-level abstractions for performance.: Portability and continuity of scientific software on future computing systems. University of Oxford, Tech. rep. (2014)
– reference: GardnerMKCharacterizing the challenges and evaluating the efficacy of a cuda-to-opencl translatorParallel Comput.20133976978610.1016/j.parco.2013.09.003
– reference: Intel Inc.: hStreams Architecture for MPSS 3.5 (2015)
– reference: ChenTCell broadband engine architecture and its first implementation—a performance viewIBM J. Res. Dev.20075155957210.1147/rd.515.0559
– reference: PGI CUDA C/C++ for x86.: https://developer.nvidia.com/pgi-cuda-cc-x86 (2020)
– reference: Wang, Z., et al.: Automatic and portable mapping of data parallel programs to opencl for gpu-based heterogeneous systems. ACM TACO (2015)
– reference: Balaprakash, P., et al.: Autotuning in high-performance computing applications. In: Proceedings of the IEEE (2018)
– reference: Cummins, C., et al.: End-to-end deep learning of optimization heuristics. In: PACT (2017)
– reference: ErnstssonASkepu 2: flexible and type-safe skeleton programming for heterogeneous parallel systemsInt. J. Parallel Program.201846628010.1007/s10766-017-0490-5
– reference: HaidlMGorlatchSHigh-level programming for many-cores using C++14 and the STLInt. J. Parallel Program.201846234110.1007/s10766-017-0497-y
– reference: ChenDCharacterizing scalability of sparse matrix-vector multiplications on phytium ft-2000+Int. J. Parallel Program.202048809710.1007/s10766-019-00646-x
– reference: Yang, C., et al.: O2render: An opencl-to-renderscript translator for porting across various GPUs or CPUs. In: IEEE 10th Symposium on Embedded Systems for Real-time Multimedia, ESTIMedia (2012)
– reference: de Carvalho Moreira, W., et al.: Exploring heterogeneous mobile architectures with a high-level programming model. In: 29th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD (2017)
– reference: Barker, K.J., et al.: Entering the petaflop era: the architecture and performance of roadrunner. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008)
– reference: Chen, T., et al.: Mxnet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR abs/1512.01274 (2015)
– reference: Mark, W.R., et al.: Cg: a system for programming graphics hardware in a c-like language. ACM Trans. Graph (2003)
– reference: Ogilvie, W.F., et al.: Fast automatic heuristic construction using active learning. In: LCPC (2014)
– reference: Diamos, G.F., et al.: Ocelot: a dynamic optimization framework for bulk-synchronous applications in heterogeneous systems. In: 19th International Conference on Parallel Architectures and Compilation Techniques, PACT (2010)
– reference: Bellens, P., et al.: Cellss: a programming model for the cell BE architecture. In: Proceedings of the ACM/IEEE SC2006 Conference on High Performance Networking and Computing (2006)
– reference: Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32: Annual Conference on Neural Information Processing Systems 2019, NeurIPS, pp. 8024–8035 (2019)
– reference: O’BrienKSupporting openmp on cellInt. J. Parallel Program.20083628931110.1007/s10766-008-0072-7
– reference: ViñasMExploiting heterogeneous parallelism with the heterogeneous programming libraryJ. Parallel Distrib. Comput.2013731627163810.1016/j.jpdc.2013.07.013
– reference: The OpenACC API specification for parallel programming.: https://www.openacc.org/ (2020)
– reference: Govindaraju, N.K., et al.: High performance discrete fourier transforms on graphics processors. In: Proceedings of the ACM/IEEE Conference on High Performance Computing, SC (2008)
– reference: KarpRMThe organization of computations for uniform recurrence equationsJ. ACM (JACM)19671456359023460410.1145/321406.321418
– reference: Nvidia’s next generation cuda compute architecture.: Fermi. NVIDIA Corporation, Tech. rep. (2009)
– reference: Beckingsale, D., et al.: Performance portable C++ programming with RAJA. In: Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammiDng, PPoPP (2019)
– reference: The OpenCL Conformance Tests.: https://github.com/KhronosGroup/OpenCL-CTS (2020)
– reference: Sycl integrates opencl devices with modern c++.: Tech. Rep. version 1.2.1 revison 6, The Khronos Group (2019)
– reference: Arevalo, A., et al.: Programming the cell broadband engine: examples and best practices (2007)
– reference: FangJEvaluating multiple streams on heterogeneous platformsParallel Process. Lett.20162641640002358509810.1142/S0129626416400028
– reference: BuckIBrook for GPUs: stream computing on graphics hardwareACM Trans. Graph20042377778610.1145/1015706.1015800
– reference: Intel’s OneAPI.: https://software.intel.com/en-us/oneapi (2020)
– reference: Szuppe, J.: Boost.compute: A parallel computing library for C++ based on opencl. In: Proceedings of the 4th International Workshop on OpenCL, IWOCL (2016)
– reference: HarveyMJSwan: a tool for porting CUDA programs to openclComput. Phys. Commun.20111821093109910.1016/j.cpc.2010.12.052
– reference: Ren, J., et al.: Proteus: Network-aware web browsing on heterogeneous mobile systems. In: CoNEXT ’18 (2018)
– reference: Wen, Y., et al.: Smart multi-task scheduling for opencl programs on cpu/gpu heterogeneous platforms. In: HiPC (2014)
– reference: NVIDIA CUDA Toolkit.: https://developer.nvidia.com/cuda-toolkit (2020)
– reference: Zhao, J., et al.: Predicting cross-core performance interference on multicore processors with regression analysis. IEEE TPDS (2016)
– reference: Muralidharan, S., et al.: Architecture-adaptive code variant tuning. In: Proceedings of the Twenty-First International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2016)
– reference: Ragan-Kelley, J., et al.: Halide: a language and compiler for optimizing parallelism, locality, and recomputation in image processing pipelines. In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2013)
– reference: Boyer, M., et al.: Improving GPU performance prediction with data transfer modeling. In: 2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (2013)
– reference: Sidelnik, A., et al.: Performance portability with the chapel language. In: 26th IEEE International Parallel and Distributed Processing Symposium, IPDPS, pp. 582–594 (2012)
– reference: Top500 Supercomputers.: https://www.top500.org/ (2020)
– reference: Yan, Y., et al.: Supporting multiple accelerators in high-level programming models. In: Proceedings of the Sixth International Workshop on Programming Models and Applications for Multicores and Manycores, PMAM@PPoPP (2015)
– reference: Patterson, D.A.: 50 years of computer architecture: From the mainframe CPU to the domain-specific tpu and the open RISC-V instruction set. In: 2018 IEEE International Solid-State Circuits Conference, ISSCC (2018)
– reference: OpenCL.: The open standard for parallel programming of heterogeneous systems. http://www.khronos.org/opencl/ (2020)
– reference: Haidl, M., Gorlatch, S.: PACXX: towards a unified programming model for programming accelerators using C++14. In: Proceedings of the 2014 LLVM Compiler Infrastructure in HPC, LLVM (2014)
– reference: Taylor, B., et al.: Adaptive optimization for opencl programs on embedded heterogeneous systems. In: LCTES (2017)
– reference: De SensiDBringing parallel patterns out of the corner: the p3 arsec benchmark suiteACM Trans. Archit. Code Optim. (TACO)20171412610.1145/3132710
– reference: Lepley, T., et al.: A novel compilation approach for image processing graphs on a many-core platform with explicitly managed memory. In: International Conference on Compilers, Architecture and Synthesis for Embedded Systems, CASES (2013)
– reference: GschwindMSynergistic processing in cell’s multicore architectureIEEE Micro200626102410.1109/MM.2006.41
– reference: Ravi, N., et al.: Apricot: an optimizing compiler and productivity tool for x86-compatible many-core coprocessors. In: International Conference on Supercomputing, ICS (2012)
– reference: Ren, J., et al.: Camel: Smart, adaptive energy optimization for mobile web interactions. In: IEEE Conference on Computer Communications (INFOCOM) (2020)
– reference: Alfieri, R.A.: An efficient kernel-based implementation of POSIX threads. In: USENIX Summer 1994 Technical Conference. USENIX Association (1994)
– reference: The OpenMP API specification for parallel programming.: https://www.openmp.org/ (2020)
– reference: Common-Shader Core.: https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl-common-core?redirectedfrom=MSDN (2018)
– reference: Introducing rdna architecture.: Tech. rep., AMD Corporation (2019)
– reference: Cummins, C., et al.: Synthesizing benchmarks for predictive modeling. In: CGO (2017)
– reference: Marqués, R., et al.: Algorithmic skeleton framework for the orchestration of GPU computations. In: Euro-Par 2013 Parallel Processing, Lecture Notes in Computer Science (2013)
– reference: Grewe, D., et al.: Portable mapping of data parallel programs to opencl for heterogeneous systems. In: CGO (2013b)
– reference: Heller, T., et al.: Using HPX and libgeodecomp for scaling HPC applications on heterogeneous supercomputers. In: Proceedings of the Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, ScalA (2013)
– reference: Ueng, S., et al.: Cuda-lite: Reducing GPU programming complexity. In: J.N. Amaral (ed.) Languages and Compilers for Parallel Computing, 21th International Workshop, LCPC (2008)
– reference: Kim, J., et al.: Translating openmp device constructs to opencl using unnecessary data transfer elimination. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC (2016)
– reference: Mishra, A., et al.: Kernel fusion/decomposition for automatic gpu-offloading. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
– reference: Kim, Y., et al.: Translating CUDA to opencl for hardware generation using neural machine translation. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
– reference: Unat, D., et al.: Mint: realizing CUDA performance in 3d stencil methods with annotated C. In: Proceedings of the 25th International Conference on Supercomputing, (2011)
– reference: Copik, M., Kaiser, H.: Using SYCL as an implementation framework for hpx.compute. In: Proceedings of the 5th International Workshop on OpenCL, IWOCL (2017)
– reference: Nvidia tesla v100 gpu architecture.: Tech. rep., NVIDIA Corporation (2017)
– reference: FreeOCL.: http://www.zuzuf.net/FreeOCL/ (2020)
– reference: Wang, Z., O’Boyle, M.F.: Partitioning streaming parallelism for multi-cores: a machine learning based approach. In: PACT (2010)
– reference: Li, Z., et al.: Streaming applications on heterogeneous platforms. In: Network and Parallel Computing—13th IFIP WG 10.3 International Conference, NPC (2016b)
– reference: KimWVossMMulticore desktop programming with intel threading building blocksIEEE Softw.201128233110.1109/MS.2011.12
– reference: Sanz MarcoVOptimizing deep learning inference on embedded systems through adaptive model selectionACM Trans. Embed. Comput.201919128
– reference: TomovSTowards dense linear algebra for hybrid GPU accelerated manycore systemsParallel Comput.201036232240276259010.1016/j.parco.2009.12.005
– reference: BaeHThe cetus source-to-source compiler infrastructure: overview and evaluationInt. J. Parallel Program.20134175376710.1007/s10766-012-0211-z
– reference: Andrade, G., et al.: Parallelme: A parallel mobile engine to explore heterogeneity in mobile computing architectures. In: Euro-Par 2016: Parallel Processing—22nd International Conference on Parallel and Distributed Computing (2016)
– reference: Chandrasekhar, A., et al.: IGC: the open source intel graphics compiler. In: IEEE/ACM International Symposium on Code Generation and Optimization, CGO (2019)
– reference: Gómez-LunaJPerformance models for asynchronous data transfers on consumer graphics processing unitsJ. Parallel Distrib. Comput.2012721117112610.1016/j.jpdc.2011.07.011
– reference: ScarpazzaDPEfficient breadth-first search on the cell/be processorIEEE Trans. Parallel Distrib. Syst.2008191381139510.1109/TPDS.2007.70811
– reference: Fang, J., et al.: Implementing and evaluating opencl on an armv8 multi-core CPU. In: 2017 IEEE International Symposium on Parallel and Distributed Processing with Applications and 2017 IEEE International Conference on Ubiquitous Computing and Communications (ISPA/IUCC) (2017)
– reference: ROCm.: A New Era in Open GPU Computing. https://www.amd.com/en/graphics/servers-solutions-rocm-hpc (2020)
– reference: Emani, M.K., et al.: Smart, adaptive mapping of parallelism in the presence of external workload. In: CGO (2013)
– reference: Hong, S., et al.: Green-marl: a DSL for easy and efficient graph analysis. In: Proceedings of the 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS (2012)
– reference: Wang, Z., et al.: Integrating profile-driven parallelism detection and machine-learning-based mapping. ACM TACO (2014b)
– reference: Bader, D.A., Agarwal, V.: FFTC: fastest Fourier transform for the IBM cell broadband engine. In: High Performance Computing, HiPC (2007)
– reference: Beignet OpenCL.: https://www.freedesktop.org/wiki/ Software/Beignet/ (2020)
– reference: Kistler, M., et al.: Petascale computing with accelerators. In: D.A. Reed, V. Sarkar (eds.) Proceedings of the 14th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2009)
– reference: Diamos, C., et al.: Compiling a high-level language for GPUs: (via language support for architectures and compilers). In: ACM SIGPLAN Conference on Programming Language Design and Implementation, PLDI (2012)
– reference: de Fine Licht, J., Hoefler, T.: hlslib: Software engineering for hardware design. CoRR (2019)
– reference: HLSL.: The High Level Shading Language for DirectX. https://docs.microsoft.com/en-us/windows/win32/direct3dhlsl/dx-graphics-hlsl (2018)
– reference: Steinkrau, D., et al.: Using GPUs for machine learning algorithms. In: Eighth International Conference on Document Analysis and Recognition (ICDAR. IEEE Computer Society (2005)
– reference: Fang, J., et al.: Test-driving intel xeon phi. In: ACM/SPEC International Conference on Performance Engineering (ICPE), pp. 137–148 (2014)
– reference: KahleJAIntroduction to the cell multiprocessorIBM J. Res. Dev.20054958960410.1147/rd.494.0589
– reference: Steuwer, M., et al.: Skelcl—a portable skeleton library for high-level GPU programming. In: 25th IEEE International Symposium on Parallel and Distributed Processing, IPDPS (2011)
– reference: Ayguadé, E., et al.: An extension of the starss programming model for platforms with multiple GPUs. In: Euro-Par 2009 Parallel Processing (2009)
– reference: Rudy, G., et al.: A programming language interface to describe transformations and code generation. In: Languages and Compilers for Parallel Computing - 23rd International Workshop, LCPC (2010)
– reference: Cole, M.I.: Algorithmic skeletons: structured management of parallel computation (1989)
– reference: Nvidia geforce gtx 980.: Tech. rep., NVIDIA Corporation (2014)
– reference: Ragan-Kelley, J., et al.: Decoupling algorithms from schedules for easy optimization of image processing pipelines. ACM Trans. Graph (2012)
– reference: Kim, J., et al.: Snucl: an opencl framework for heterogeneous CPU/GPU clusters. In: International Conference on Supercomputing, ICS (2012)
– reference: Parallel Patterns Library.: https://docs.microsoft.com/en-us/cpp/parallel/concrt/parallel-patterns-library-ppl?view=vs-2019 (2016)
– reference: Pham, D., et al.: The design methodology and implementation of a first-generation CELL processor: a multi-core soc. In: Proceedings of the IEEE 2005 Custom Integrated Circuits Conference, CICC (2005)
– reference: TournavitisGTowards a holistic approach to auto-parallelization: integrating profile-driven parallelism detection and machine-learning based mappingACM Sigplan Not.20094417718710.1145/1543135.1542496
– reference: Fang, J., et al.: A comprehensive performance comparison of CUDA and opencl. In: ICPP (2011)
– reference: Amini, M., et al.: Static compilation analysis for host-accelerator communication optimization. In: Languages and Compilers for Parallel Computing, 24th International Workshop, LCPC (2011)
– reference: Heller, T., et al.: Closing the performance gap with modern C++. In: High Performance Computing - ISC High Performance 2016 International Workshops, ExaComm, E-MuCoCoS, HPC-IODC, IXPUG, IWOPH, P∧\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\wedge$$\end{document}3MA, VHPC, WOPSSS (2016)
– reference: Grewe, D., et al.: Opencl task partitioning in the presence of GPU contention. In: LCPC (2013a)
– reference: JääskeläinenPpocl: a performance-portable opencl implementationInt. J. Parallel Programm.20154375278510.1007/s10766-014-0320-y
– reference: Wang, Z., O’Boyle, M.F.: Using machine learning to partition streaming programs. ACM TACO (2013)
– reference: Wang, Z.: Machine learning based mapping of data and streaming parallelism to multi-cores. Ph.D. thesis, University of Edinburgh (2011)
– reference: Zenker, E., et al.: Alpaka—an abstraction library for parallel kernel acceleration. In: 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops (2016)
– reference: Lee, S., Eigenmann, R.: Openmpc: Extended openmp programming and tuning for GPUs. In: Conference on High Performance Computing Networking, Storage and Analysis, SC (2010)
– reference: YuanLUsing machine learning to optimize web interactions on heterogeneous mobile systemsIEEE Access2019713939413940810.1109/ACCESS.2019.2936620
– reference: Amd brook+ programming.: Tech. rep., AMD Corporation (2007)
– reference: HCC.: Heterogeneous Compute Compiler. https://gpuopen.com/compute-product/hcc-heterogeneous-compute-compiler/ (2020)
– reference: The Frontier Supercomputer.: https://www.olcf.ornl.gov/frontier/ (2020)
– reference: Crawford, C.H., et al.: Accelerating computing with the cell broadband engine processor. In: Proceedings of the 5th Conference on Computing Frontiers (2008)
– reference: Nvidia tesla p100.: Tech. rep., NVIDIA Corporation (2016)
– reference: Heler, T., et al.: Hpx—an open source c++ standard library for parallelism and concurrency. In: OpenSuCo (2017)
– reference: Directcompute programming guide.: Tech. rep., NVIDIA Corporation (2010)
– reference: Wong, H., et al.: Demystifying GPU microarchitecture through microbenchmarking. In: IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS (2010)
– reference: Marco, V.S., et al.: Improving spark application throughput via memory aware task co-location: a mixture of experts approach. In: Middleware (2017)
– reference: Merrill, D., et al.: Scalable GPU graph traversal. In: Proceedings of the 17th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPOPP (2012)
– reference: Ren, J., et al.: Optimise web browsing on heterogeneous mobile platforms: a machine learning based approach. In: INFOCOM (2017)
– reference: Komornicki, A., et al.: Roadrunner: hardware and software overview (2009)
– reference: ViñasMHeterogeneous distributed computing based on high-level abstractionsPract. Exp. Concurr. Comput.201820e466410.1002/cpe.4664
– reference: DuranAOmpss: a proposal for programming heterogeneous multi-core architecturesParallel Process. Lett.201121173193281200010.1142/S0129626411000151
– reference: Liu, B., et al.: Software pipelining for graphic processing unit acceleration: partition, scheduling and granularity. IJHPCA (2016)
– reference: Fang, J.: Towards a systematic exploration of the optimization space for many-core processors. Ph.D. thesis, Delft University of Technology, Netherlands (2014)
– reference: Gregory, K., Miller, A.: C++ AMP: accelerated massive parallelism with microsoft visual C++ (2012)
– reference: Han, T.D., et al.: hicuda: a high-level directive-based language for GPU programming. In: Proceedings of 2nd Workshop on General Purpose Processing on Graphics Processing Units, GPGPU, ACM International Conference Proceeding Series (2009)
– reference: Zhang, P., et al.: Optimizing streaming parallelism on heterogeneous many-core architectures. IEEE TPDS (2020)
– reference: GalliumCompute.: https://github.com/intel/compute-runtime (2020)
– reference: Che, S., et al.: Rodinia: A benchmark suite for heterogeneous computing. In: Proceedings of the 2009 IEEE International Symposium on Workload Characterization. IEEE Computer Society (2009)
– reference: AMD’s OpenCL Implementation.: https://github.com/RadeonOpenCompute/ROCm-OpenCL-Runtime (2020)
– reference: Kudlur, M., et al.: Orchestrating the execution of stream programs on multicore platforms. In: Proceedings of the ACM SIGPLAN 2008 Conference on Programming Language Design and Implementation, PLDI (2008)
– reference: Renderscript Compute.: http://developer.android.com/guide/topics/renderscript/compute.html (2020)
– reference: Intel Manycore Platform Software Stack.: https://software.intel.com/en-us/articles/intel-manycore-platform-software-stack-mpss (2020)
– ident: 39_CR109
  doi: 10.1109/CASES.2013.6662510
– ident: 39_CR60
– ident: 39_CR88
  doi: 10.1145/2150976.2151013
– ident: 39_CR174
– ident: 39_CR31
– ident: 39_CR8
  doi: 10.1007/978-3-319-43659-3_33
– ident: 39_CR77
– ident: 39_CR121
  doi: 10.1109/IPDPS.2012.59
– ident: 39_CR13
  doi: 10.1109/JPROC.2018.2841200
– ident: 39_CR21
  doi: 10.1109/IPDPSW.2013.236
– ident: 39_CR33
  doi: 10.1145/3078155.3078187
– volume: 46
  start-page: 23
  year: 2018
  ident: 39_CR74
  publication-title: Int. J. Parallel Program.
  doi: 10.1007/s10766-017-0497-y
– ident: 39_CR130
  doi: 10.1145/2159430.2159431
– ident: 39_CR164
  doi: 10.1109/IPDPS.2012.60
– ident: 39_CR156
– ident: 39_CR187
  doi: 10.1145/2400682.2400713
– ident: 39_CR15
  doi: 10.1007/978-3-642-11970-5_14
– ident: 39_CR104
  doi: 10.1145/1201775.882363
– ident: 39_CR190
  doi: 10.1145/2677036
– ident: 39_CR42
– ident: 39_CR111
  doi: 10.1109/IPDPSW.2016.99
– ident: 39_CR62
  doi: 10.1145/1964218.1964221
– ident: 39_CR4
– ident: 39_CR99
  doi: 10.1109/SC.2016.50
– ident: 39_CR10
  doi: 10.1007/978-3-642-03869-3_79
– ident: 39_CR139
  doi: 10.1007/978-3-319-17473-0_10
– volume: 20
  start-page: e4664
  year: 2018
  ident: 39_CR189
  publication-title: Pract. Exp. Concurr. Comput.
  doi: 10.1002/cpe.4664
– ident: 39_CR66
– volume: 19
  start-page: 1381
  year: 2008
  ident: 39_CR162
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2007.70811
– volume: 69
  start-page: 23
  year: 2014
  ident: 39_CR167
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-014-1213-y
– ident: 39_CR128
  doi: 10.1109/IPDPSW.2016.217
– ident: 39_CR9
– volume: 27
  start-page: 1
  year: 2009
  ident: 39_CR163
  publication-title: IEEE Micro
– ident: 39_CR22
  doi: 10.1109/IPDPSW.2010.5470823
– ident: 39_CR89
– ident: 39_CR144
– ident: 39_CR185
  doi: 10.1145/1995896.1995932
– ident: 39_CR196
  doi: 10.1145/1854273.1854313
– ident: 39_CR3
– ident: 39_CR175
– ident: 39_CR155
  doi: 10.1145/3281411.3281422
– ident: 39_CR105
  doi: 10.1145/1375581.1375596
– ident: 39_CR133
– ident: 39_CR199
  doi: 10.1109/ISPASS.2010.5452013
– ident: 39_CR30
– ident: 39_CR166
  doi: 10.1109/IPDPS.2011.269
– ident: 39_CR181
– ident: 39_CR38
  doi: 10.1007/978-3-642-45293-2_13
– volume: 36
  start-page: 232
  year: 2010
  ident: 39_CR180
  publication-title: Parallel Comput.
  doi: 10.1016/j.parco.2009.12.005
– ident: 39_CR34
  doi: 10.1145/1366230.1366234
– volume: 26
  start-page: 1640002
  issue: 4
  year: 2016
  ident: 39_CR57
  publication-title: Parallel Process. Lett.
  doi: 10.1142/S0129626416400028
– ident: 39_CR178
– start-page: 359
  volume-title: GPU Computing Gems Jade Edition, Applications of GPU Computing Series
  year: 2012
  ident: 39_CR18
  doi: 10.1016/B978-0-12-385963-1.00026-5
– ident: 39_CR58
– ident: 39_CR82
  doi: 10.1007/978-3-319-46079-6_2
– ident: 39_CR80
  doi: 10.1109/PACT.2011.60
– ident: 39_CR115
  doi: 10.1177/1094342015585845
– ident: 39_CR70
  doi: 10.1109/CGO.2013.6494993
– ident: 39_CR118
  doi: 10.1007/978-3-642-40047-6_86
– ident: 39_CR124
  doi: 10.1109/IPDPSW.2012.226
– ident: 39_CR146
– ident: 39_CR169
– ident: 39_CR35
  doi: 10.1109/PACT.2017.24
– ident: 39_CR127
  doi: 10.1145/2872362.2872411
– ident: 39_CR5
– ident: 39_CR192
  doi: 10.1145/2579561
– ident: 39_CR195
  doi: 10.1145/2512436
– ident: 39_CR32
– ident: 39_CR44
  doi: 10.5281/zenodo.571466
– ident: 39_CR202
  doi: 10.1145/2688500.2688505
– ident: 39_CR135
– ident: 39_CR173
– ident: 39_CR153
  doi: 10.1109/INFOCOM41043.2020.9155489
– ident: 39_CR64
  doi: 10.1109/SC.2008.5213922
– volume: 28
  start-page: 39
  year: 2008
  ident: 39_CR114
  publication-title: IEEE Micro
  doi: 10.1109/MM.2008.31
– ident: 39_CR101
  doi: 10.1145/1504176.1504212
– volume: 48
  start-page: 80
  year: 2020
  ident: 39_CR29
  publication-title: Int. J. Parallel Program.
  doi: 10.1007/s10766-019-00646-x
– ident: 39_CR198
  doi: 10.1145/1128022.1128027
– ident: 39_CR134
– ident: 39_CR165
  doi: 10.1109/ICDAR.2005.251
– ident: 39_CR157
– ident: 39_CR112
  doi: 10.1007/978-3-319-47099-3_10
– ident: 39_CR102
  doi: 10.1109/ICPP.2013.35
– ident: 39_CR116
  doi: 10.1145/3135974.3135984
– ident: 39_CR97
  doi: 10.1145/2304576.2304623
– ident: 39_CR20
– ident: 39_CR140
– ident: 39_CR179
– ident: 39_CR59
– ident: 39_CR48
– volume: 72
  start-page: 1117
  year: 2012
  ident: 39_CR63
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2011.07.011
– ident: 39_CR69
  doi: 10.1007/978-3-319-09967-5_5
– volume: 19
  start-page: 1
  year: 2019
  ident: 39_CR160
  publication-title: ACM Trans. Embed. Comput.
– volume: 23
  start-page: 777
  year: 2004
  ident: 39_CR24
  publication-title: ACM Trans. Graph
  doi: 10.1145/1015706.1015800
– volume: 14
  start-page: 563
  year: 1967
  ident: 39_CR95
  publication-title: J. ACM (JACM)
  doi: 10.1145/321406.321418
– ident: 39_CR72
  doi: 10.1145/3148173.3148185
– ident: 39_CR54
  doi: 10.1109/ISPA/IUCC.2017.00131
– ident: 39_CR25
  doi: 10.1109/CGO.2019.8661189
– volume: 36
  start-page: 289
  year: 2008
  ident: 39_CR138
  publication-title: Int. J. Parallel Program.
  doi: 10.1007/s10766-008-0072-7
– volume: 14
  start-page: 1
  year: 2017
  ident: 39_CR40
  publication-title: ACM Trans. Archit. Code Optim. (TACO)
  doi: 10.1145/3132710
– ident: 39_CR136
– ident: 39_CR159
  doi: 10.1007/978-3-642-19595-2_10
– ident: 39_CR152
  doi: 10.1145/2304576.2304585
– ident: 39_CR1
– volume: 182
  start-page: 1093
  year: 2011
  ident: 39_CR78
  publication-title: Comput. Phys. Commun.
  doi: 10.1016/j.cpc.2010.12.052
– ident: 39_CR68
– ident: 39_CR142
– ident: 39_CR7
– ident: 39_CR73
  doi: 10.1109/LLVM-HPC.2014.9
– ident: 39_CR177
– ident: 39_CR191
  doi: 10.1007/978-3-642-54807-9_9
– ident: 39_CR11
– ident: 39_CR131
– ident: 39_CR154
  doi: 10.1109/INFOCOM.2017.8057087
– ident: 39_CR120
  doi: 10.1109/TPDS.2015.2394802
– volume: 73
  start-page: 1627
  year: 2013
  ident: 39_CR188
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2013.07.013
– volume: 74
  start-page: 3202
  year: 2014
  ident: 39_CR50
  publication-title: J. Parallel Distrib. Comput.
  doi: 10.1016/j.jpdc.2014.07.003
– ident: 39_CR67
  doi: 10.1109/ISPASS.2011.5762730
– ident: 39_CR123
  doi: 10.1145/2145816.2145832
– ident: 39_CR141
  doi: 10.1109/JPROC.2008.917757
– ident: 39_CR107
  doi: 10.1109/SC.2010.36
– ident: 39_CR183
– ident: 39_CR6
– ident: 39_CR83
  doi: 10.1145/2530268.2530269
– ident: 39_CR43
  doi: 10.5281/zenodo.1244532
– ident: 39_CR172
– ident: 39_CR200
  doi: 10.1145/2712386.2712405
– ident: 39_CR41
  doi: 10.1109/SBAC-PAD.2017.11
– ident: 39_CR186
  doi: 10.1109/CCGrid.2014.16
– ident: 39_CR106
  doi: 10.1145/1504176.1504194
– ident: 39_CR53
  doi: 10.1109/ICPP.2011.45
– ident: 39_CR117
  doi: 10.1145/1201775.882362
– ident: 39_CR145
  doi: 10.1109/ISSCC.2018.8310168
– volume: 40
  start-page: 535
  year: 2019
  ident: 39_CR45
  publication-title: Lobachevskii J. Math.
  doi: 10.1134/S1995080219050056
– ident: 39_CR81
– ident: 39_CR151
  doi: 10.1145/2491956.2462176
– ident: 39_CR14
  doi: 10.1109/SC.2008.5217926
– volume: 19
  start-page: 1236
  year: 2018
  ident: 39_CR113
  publication-title: Front. IT EE
– ident: 39_CR204
  doi: 10.1109/IPDPSW.2016.50
– ident: 39_CR171
  doi: 10.1145/3078633.3081040
– volume: 46
  start-page: 62
  year: 2018
  ident: 39_CR52
  publication-title: Int. J. Parallel Program.
  doi: 10.1007/s10766-017-0490-5
– ident: 39_CR92
– ident: 39_CR158
– ident: 39_CR184
  doi: 10.1007/978-3-540-89740-8_1
– volume: 41
  start-page: 753
  year: 2013
  ident: 39_CR12
  publication-title: Int. J. Parallel Program.
  doi: 10.1007/s10766-012-0211-z
– volume: 39
  start-page: 769
  year: 2013
  ident: 39_CR61
  publication-title: Parallel Comput.
  doi: 10.1016/j.parco.2013.09.003
– ident: 39_CR197
  doi: 10.1109/HiPC.2014.7116910
– volume: 29
  start-page: 283
  year: 2018
  ident: 39_CR37
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2017.2755657
– ident: 39_CR27
– volume: 49
  start-page: 589
  year: 2005
  ident: 39_CR94
  publication-title: IBM J. Res. Dev.
  doi: 10.1147/rd.494.0589
– ident: 39_CR108
  doi: 10.1145/1815961.1816021
– ident: 39_CR170
  doi: 10.1145/2909437.2909454
– ident: 39_CR147
– ident: 39_CR125
  doi: 10.1109/CGO.2019.8661188
– ident: 39_CR86
– ident: 39_CR103
– ident: 39_CR149
– ident: 39_CR201
  doi: 10.1109/ESTIMedia.2012.6507031
– ident: 39_CR51
  doi: 10.1109/CGO.2013.6495010
– ident: 39_CR47
  doi: 10.1145/1854273.1854318
– ident: 39_CR98
  doi: 10.1109/CGO.2019.8661172
– ident: 39_CR132
– ident: 39_CR65
  doi: 10.1109/IPDPS.2014.24
– ident: 39_CR110
  doi: 10.1145/1735688.1735698
– ident: 39_CR19
  doi: 10.1109/SC.2006.17
– ident: 39_CR55
  doi: 10.1145/2568088.2576799
– ident: 39_CR150
  doi: 10.1145/2185520.2185528
– volume: 44
  start-page: 177
  year: 2009
  ident: 39_CR182
  publication-title: ACM Sigplan Not.
  doi: 10.1145/1543135.1542496
– ident: 39_CR193
  doi: 10.1109/JPROC.2018.2817118
– ident: 39_CR209
– ident: 39_CR87
  doi: 10.1145/1941553.1941590
– volume: 62
  start-page: 1023
  year: 2012
  ident: 39_CR39
  publication-title: J. Supercomput.
  doi: 10.1007/s11227-012-0789-3
– ident: 39_CR90
– ident: 39_CR122
  doi: 10.1145/3084540
– ident: 39_CR46
– ident: 39_CR207
  doi: 10.1109/TPDS.2020.2978045
– ident: 39_CR96
  doi: 10.1145/2807591.2807621
– ident: 39_CR129
  doi: 10.1109/IPDPSW.2012.296
– volume: 28
  start-page: 23
  year: 2011
  ident: 39_CR100
  publication-title: IEEE Softw.
  doi: 10.1109/MS.2011.12
– ident: 39_CR16
  doi: 10.1145/3293883.3302577
– ident: 39_CR84
– ident: 39_CR85
– ident: 39_CR176
– ident: 39_CR206
  doi: 10.1145/3203217.3203244
– volume: 7
  start-page: 139394
  year: 2019
  ident: 39_CR203
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2019.2936620
– volume: 21
  start-page: 173
  year: 2011
  ident: 39_CR49
  publication-title: Parallel Process. Lett.
  doi: 10.1142/S0129626411000151
– ident: 39_CR79
– ident: 39_CR91
– ident: 39_CR56
– ident: 39_CR26
  doi: 10.1109/IISWC.2009.5306797
– ident: 39_CR36
  doi: 10.1109/CGO.2017.7863731
– volume: 51
  start-page: 559
  year: 2007
  ident: 39_CR28
  publication-title: IBM J. Res. Dev.
  doi: 10.1147/rd.515.0559
– ident: 39_CR119
  doi: 10.1109/ICPADS.2011.48
– volume: 26
  start-page: 10
  year: 2006
  ident: 39_CR71
  publication-title: IEEE Micro
  doi: 10.1109/MM.2006.41
– ident: 39_CR208
  doi: 10.1109/TPDS.2015.2442983
– ident: 39_CR148
– ident: 39_CR194
  doi: 10.1145/1854273.1854313
– ident: 39_CR161
  doi: 10.1145/3293320.3293338
– volume: 18
  start-page: 1
  year: 2010
  ident: 39_CR23
  publication-title: Sci. Program.
– volume: 22
  start-page: 78
  year: 2011
  ident: 39_CR76
  publication-title: IEEE Trans. Parallel Distrib. Syst.
  doi: 10.1109/TPDS.2010.62
– ident: 39_CR17
– ident: 39_CR137
– ident: 39_CR168
  doi: 10.1007/978-3-540-89740-8_2
– volume: 43
  start-page: 752
  year: 2015
  ident: 39_CR93
  publication-title: Int. J. Parallel Programm.
  doi: 10.1007/s10766-014-0320-y
– ident: 39_CR75
  doi: 10.1145/1513895.1513902
– ident: 39_CR205
  doi: 10.1109/IPDPS.2018.00061
– ident: 39_CR143
– ident: 39_CR2
– ident: 39_CR126
SSID ssj0002710226
ssib053822361
Score 2.369977
Snippet Heterogeneous many-cores are now an integral part of modern computing systems ranging from embedding systems to supercomputers. While heterogeneous many-core...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 382
SubjectTerms Algorithms
Assembly language
Communication
Computation
Computer Hardware
Computer Science
Computer Systems Organization and Communication Networks
Linear algebra
Machine learning
Parallel programming
Programmers
Software
Supercomputers
Survey Paper
SummonAdditionalLinks – databaseName: SpringerLINK Contemporary 1997-Present
  dbid: RSV
  link: http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1JSwMxFA5aPXixrlitkoM3DcxkmUm8iSgepBRc8DbMZEGhi8y0Bf-9SZq0KCroeTJJeHlbyHvfB8ApyZmSqcKIZNQgipVCImME2csISSRlLDcexPUu7_X487Poh6awJla7xydJ76kXzW5Wu1KK3HXHd5QiugrWmEObcXf0-6eoRdaCcUQU8f4YuyDqedcwwxRRgXHonvl-2s8Rapl2fnkp9QHopv2_rW-BzZBwwsu5hmyDFT3aAe1I5gCDbe-Cfr-sHa_KAIaSraFdAXqinAbazBa-uMKZsdU3PZ42cGidCHIQmM0FLKErTK_1y7wYHjbTeqbf98DjzfXD1S0KdAtIWjt0pPSSl5IIbTCVOM2yRFRprphzCpIYxRPDSrt7w93TZiWtp7DBLcOJSriijJJ90BqNR_oAQGPTLMZIXglhaC4rLmmpOK1oageKTHdAGkVcyIBF7igxBsUCRdmLrLAiK7zICtoBZ4t_3uZIHL-O7saTK4JVNoVNbTi2G8BJB5zHk1p-_nm2w78NPwIb2B22r3rpgtaknupjsC5nk9emPvHa-gHCPN6r
  priority: 102
  providerName: Springer Nature
Title Parallel programming models for heterogeneous many-cores: a comprehensive survey
URI https://link.springer.com/article/10.1007/s42514-020-00039-4
https://www.proquest.com/docview/2938245920
Volume 2
WOSCitedRecordID wos000710561000008&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVPQU
  databaseName: Computer Science Database
  customDbUrl:
  eissn: 2524-4930
  dateEnd: 20241214
  omitProxy: false
  ssIdentifier: ssj0002710226
  issn: 2524-4922
  databaseCode: K7-
  dateStart: 20190501
  isFulltext: true
  titleUrlDefault: http://search.proquest.com/compscijour
  providerName: ProQuest
– providerCode: PRVPQU
  databaseName: ProQuest Central
  customDbUrl:
  eissn: 2524-4930
  dateEnd: 20241214
  omitProxy: false
  ssIdentifier: ssj0002710226
  issn: 2524-4922
  databaseCode: BENPR
  dateStart: 20190501
  isFulltext: true
  titleUrlDefault: https://www.proquest.com/central
  providerName: ProQuest
– providerCode: PRVAVX
  databaseName: SpringerLINK Contemporary 1997-Present
  customDbUrl:
  eissn: 2524-4930
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0002710226
  issn: 2524-4922
  databaseCode: RSV
  dateStart: 20190501
  isFulltext: true
  titleUrlDefault: https://link.springer.com/search?facet-content-type=%22Journal%22
  providerName: Springer Nature
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV07T8MwELagZWDhIUAUSuWBDSwSx05iFgSoFQOqovJQtyixExWpL5K2Ev-es-u0AokuLFliO4rv_N3Zd74PoUsv4Eq6ihLPZzlhVCkifO4R2Ix4jmScB7kp4vocdLthvy8ie-BW2rTKChMNUKuJ1GfkN2CWQsq4oM7d9JNo1igdXbUUGtuo7lIAYR2UDUilT7CWaVVbxCAz1ebUMLBRThlhglJ7j8bcpgP1dRnR-ylzZZWwn7Zq7YD-ipkaU9TZ_-9PHKA964Ti-6XWHKKtbHyEoigpNLHKENucrRF8HBumnBKDa4sHOnNmAgqXTeYlHgGKEF0Ds7zFCdaZ6UU2WGbD43JeLLKvY_TWab8-PhHLt0AkLETNSi_DRHoiyymT1PV9R6RuoLhGBenlKnRynsDE5KGObaYSoAKsm08d5YSKceadoNp4Ms5OEc7Bz-LcC1IhchbINJQsUSFLmQsNhZ81kFvNbCxtMXLNiTGMV2WUjTRikEZspBGzBrpa9ZkuS3FsbN2sRBDbZVnG6_lvoOtKiOvXf492tnm0c7RLtd6YNJcmqs2KeXaBduRi9lEWLVR_aHejXssoJzx7L-_f3Qbk7w
linkProvider ProQuest
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMw1V1LT9wwEB7RpVJ7oVRt1S0vH9pTazWxx0lcqUKlgEBsV6uKStzSxHZEJdiFhAXxp_obO_YmrIoENw6c44zkzDcPxzPzAbyXqbImtoLLBCuOwlquEyU5HUZkZFCptApDXAfpcJgdHenRAvztemF8WWXnE4OjthPj_5F_prCUCVRaRJtn59yzRvnb1Y5CYwaLA3d9RUe25uv-Nun3gxC7O4ff93jLKsANwc1zr5usMFK7SqARcZJEuoxTqzz2jaxsFlWqiKSuMn-DVxoyCPLhiYhslFlUKEnuE1hEiYnqweLWznD0s0MweQ_RTTMJsUD4AB4434QSyFEL0XbuhP49MpgYuT_BhSZZjv9Hx3nKe-uWNgS_3ReP7bMtw1KbZrNvM7t4CQtu_ApGo6L21DEnrK1KO6XNssAF1DBK3tmxrw2akEm5ybRhp-QnuZ_y2XxhBfO197U7ntX7s2ZaX7rr1_DrQTbxBnrjydi9BVZRJqmUTEutK0xNmRksbIYlxrRQJ64PcafJ3LTj1j3rx0l-Myg6aD8n7edB-zn24ePNO2ezYSP3rl7tVJ63jqfJ5_ruw6cONPPHd0t7d7-0DXi2d_hjkA_2hwcr8Fx4zIainlXoXdRTtwZPzeXFn6Zeb02Cwe-HhtM_HR48_w
linkToPdf http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwpV1bS8MwFD7oFPHFecXp1Dz4psE2TXrxTdShKGN4w7fSJg0K7kK7Dfz3nmTtvKCC-Nw0TU_OLeSc7wPY9wKhpKsY9XyuKWdK0cgXHsXDiOdILkSgLYjrddBuh4-PUedDF7-tdq-uJCc9DQalqTc8Gih9NG18Q01zOTVHH9tdSvkszHE8yZiirpvbh0qj0JpZhS5ifTMzAdVysDHBOOURY2UnzffTfo5W7ynol1tTG4xa9f__xjIslYkoOZlozgrMZL1VqFckD6S0-TXodJLc8K28kLKUq4tfI5ZApyCY8ZInU1DTRz3M-qOCdNG5UAONWRyThJiC9Tx7mhTJk2KUj7PXdbhvnd-dXtCShoFKtE9DVi_DRHpRphmXzPV9J0rdQAnjLKSnVehokeDqdWiuPFOJHgSDns8c5YSKC-5tQK3X72WbQDSmX0J4QRpFmgcyDSVPVMhT7uLAyM8a4FbijmWJUW6oMl7iKbqyFVmMIoutyGLegIPpO4MJQsevo5vVLsaltRYxpjwhwwUwpwGH1a69P_55tq2_Dd-Dhc5ZK76-bF9twyIz-24LY5pQG-ajbAfm5Xj4XOS7VonfABrc6nM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Parallel+programming+models+for+heterogeneous+many-cores%3A+a+comprehensive+survey&rft.jtitle=CCF+transactions+on+high+performance+computing+%28Online%29&rft.au=Fang%2C+Jianbin&rft.au=Huang%2C+Chun&rft.au=Tang%2C+Tao&rft.au=Wang%2C+Zheng&rft.date=2020-12-01&rft.pub=Springer+Nature+B.V&rft.issn=2524-4922&rft.eissn=2524-4930&rft.volume=2&rft.issue=4&rft.spage=382&rft.epage=400&rft_id=info:doi/10.1007%2Fs42514-020-00039-4
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2524-4922&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2524-4922&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2524-4922&client=summon