Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters

For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the...

Full description

Saved in:
Bibliographic Details
Published in:Proceedings / IEEE International Conference on Cluster Computing pp. 627 - 634
Main Authors: Odajima, Tetsuya, Boku, Taisuke, Hanawa, Toshihiro, Murai, Hitoshi, Nakao, Masahiro, Tabuchi, Akihiro, Sato, Mitsuhisa
Format: Conference Proceeding
Language:English
Published: IEEE 01.09.2015
Subjects:
ISSN:1552-5244
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.
AbstractList For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong scalability. To reduce the communication latency between GPUs, we proposed the Tightly Coupled Accelerator (TCA) architecture and developed the PEACH2 board as a proof-of-concept interconnection system for TCA. Although PEACH2 provides very low communication latency, there are some hardware limitations due to its implementation depending on PCIe technology, such as the practical number of nodes in a system which is 16 currently named sub-cluster. More number of nodes should be connected by conventional interconnections such as InfiniBand, and the entire network system is configured as a hybrid one with global conventional network and local high-speed network by PEACH2. For ease of user programmability, it is desirable to operate such a complicated communication system at the library or language level (which hides the system). In this paper, we develop a hybrid interconnection network system combining PEACH2 and InfiniBand, and implement it based on a high-level PGAS language for accelerated clusters named XcalableACC (XACC). A preliminary performance evaluation confirms that the hybrid network improves the performance based on the Himeno benchmark for stencil computation by up to 40%, relative to MVAPICH2 with GDR on InfiniBand. Additionally, Allgather collective communication with a hybrid network improves the performance by up to 50% for networks of 8 to 16 nodes. The combination of local communication, supported by the low latency of PEACH2 and global communication supported by the high bandwidth and scalability of InfiniBand, results in an improvement of overall performance.
Author Odajima, Tetsuya
Boku, Taisuke
Nakao, Masahiro
Murai, Hitoshi
Tabuchi, Akihiro
Hanawa, Toshihiro
Sato, Mitsuhisa
Author_xml – sequence: 1
  givenname: Tetsuya
  surname: Odajima
  fullname: Odajima, Tetsuya
  email: odajima@hpcs.cs.tsukuba.ac.jp
  organization: Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan
– sequence: 2
  givenname: Taisuke
  surname: Boku
  fullname: Boku, Taisuke
  organization: Center for Comput. Sci., Univ. of Tsukuba, Tsukuba, Japan
– sequence: 3
  givenname: Toshihiro
  surname: Hanawa
  fullname: Hanawa, Toshihiro
  organization: Inf. Technol. Center, Univ. of Tokyo, Tokyo, Japan
– sequence: 4
  givenname: Hitoshi
  surname: Murai
  fullname: Murai, Hitoshi
– sequence: 5
  givenname: Masahiro
  surname: Nakao
  fullname: Nakao, Masahiro
– sequence: 6
  givenname: Akihiro
  surname: Tabuchi
  fullname: Tabuchi, Akihiro
  organization: Grad. Sch. of Syst. & Inf. Eng., Univ. of Tsukuba, Tsukuba, Japan
– sequence: 7
  givenname: Mitsuhisa
  surname: Sato
  fullname: Sato, Mitsuhisa
BookMark eNotj11LwzAYhSNMcM7dC97kD3Tmq0l7OcOcg4JDN_BuvGmTGkhTSTtk_96KXp3DeeCBc4tmsY8WoXtKVpSS8lFXx_fD5m3FCM2nhV2hZakKKqTiMi8LNUNzmucsy5kQN2g5DN4QJpUUJWFzFF4uJvkG677rztHXMPo-4m8_fuKDXmOIDd5F56N_-q0TAryHBCHYgPepbxN0nY8triC2Z2gt_qghgAl2rTV2fcLb_RHrcB5Gm4Y7dO0gDHb5nwt0fN4c9EtWvW53el1lnjI-Zo5Loqy1zDTcKEOoaXLmGJQCKHXgLNSCOKKENIVRjonaTE_LgjZKNkxyvkAPf14_WU5fyXeQLifFiZKS8h-YIVtr
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/CLUSTER.2015.112
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEL
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9781467365987
146736598X
EndPage 634
ExternalDocumentID 7307661
Genre orig-research
GroupedDBID 29O
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i123t-f3607eee2bd3b7b01bd52f2a94a11fafeac40f0746b8b7f24cb659981d76d2633
IEDL.DBID RIE
ISSN 1552-5244
IngestDate Wed Aug 27 02:50:17 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i123t-f3607eee2bd3b7b01bd52f2a94a11fafeac40f0746b8b7f24cb659981d76d2633
PageCount 8
ParticipantIDs ieee_primary_7307661
PublicationCentury 2000
PublicationDate 20150901
PublicationDateYYYYMMDD 2015-09-01
PublicationDate_xml – month: 09
  year: 2015
  text: 20150901
  day: 01
PublicationDecade 2010
PublicationTitle Proceedings / IEEE International Conference on Cluster Computing
PublicationTitleAbbrev CLUSTER
PublicationYear 2015
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib026764902
ssj0037306
Score 1.9363782
Snippet For the execution of parallel HPC applications on GPU-ready clusters, high communication latency between GPUs over nodes will be a serious problem on strong...
SourceID ieee
SourceType Publisher
StartPage 627
SubjectTerms Accelerator
Arrays
Communication systems
Electronics packaging
GPU Cluster
Graphics processing units
Interconnect
PGAS Language
Programming
Scalability
Tightly Coupled Accelerators
XcalableACC
Title Hybrid Communication with TCA and InfiniBand on a Parallel Programming Language XcalableACC for GPU Clusters
URI https://ieeexplore.ieee.org/document/7307661
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9zePA0dRO_ycGjdf1Ik-Y4i3PCGAU32G0kaQKF2ck-BP973-u6ycCLt9CWtslLm_d7eb_3I-TBmsTYwBhPwq_PYwIASsIS3zMsUM6GWsauUi0ZitEomU5l1iCPey6MtbZKPrNP2Kz28vOF2WCorAuzUXDEOkdC8C1Xazd3Qi44k_6-dFQE11bMojhGsMXYbovSl910OHkHXxHzumKk0BwIq1TrSr_1vzc6JZ1fgh7N9kvPGWnY8py0dgoNtP5g22Q--EZGFj2ggVCMvdJx2qOqzOlb6YqyeMYmnFI0U0uUV5nj_TFz6wMeQId1VJNOwaRItuqlKQV3l75mE5rON1htYdUhk_7LOB14tb6CV8B6tfZcxH0BXQp1Hmmh_UDncehCJZkKAgemUob5DgVJdKKFC5nRPAZ4FuSC5yGPogvSLBelvSQ0BxQVWpkYALwMnFAZOXDdI8aticF9V1ekjYM3-9yW0JjV43b99-EbcoK22aZy3ZLmermxd-TYfK2L1fK-svsPLfirlw
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3NS8MwFA9jCnqauonf5uDRujZNk_Y4i7phHQU32G2kaQKD2ck-BP973-u6ycCLt9CWtsl77fvI-70fIXdGh9p4WjsR_PocLiFACXnoOpp7yhqWRYEtWUsS2e-Ho1GU1sj9FgtjjCmLz8wDDsu9_HymV5gqa4M2SoGxzl7AOXPXaK2N9jAhBY_cbfMoH64usUVBgOEW55tNSjdqx8nwHbxFrOwKEESzQ61SWpbnxv_e6Yi0fiF6NN0an2NSM8UJaWw4Gmj1yTbJtPuNmCy6AwShmH2lg7hDVZHTXmEnxeQRh3BK0VTNkWBlivfH2q0PeABNqrwmHYFQEW7ViWMKDi99SYc0nq6w38KiRYbPT4O461QMC84ELNbSsb5wJUyJZbmfycz1sjxglqmIK8-zICyluWuRkiQLM2kZ15kIIEDzcilyJnz_lNSLWWHOCM0hjmImCjWEvBzc0Mi34Lz7XBgdgAOvzkkTF2_8uW6iMa7W7eLvw7fkoDt4S8ZJr_96SQ5RTuvCritSX85X5prs66_lZDG_KXXgB0rDrt4
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%2F+IEEE+International+Conference+on+Cluster+Computing&rft.atitle=Hybrid+Communication+with+TCA+and+InfiniBand+on+a+Parallel+Programming+Language+XcalableACC+for+GPU+Clusters&rft.au=Odajima%2C+Tetsuya&rft.au=Boku%2C+Taisuke&rft.au=Hanawa%2C+Toshihiro&rft.au=Murai%2C+Hitoshi&rft.date=2015-09-01&rft.pub=IEEE&rft.issn=1552-5244&rft.spage=627&rft.epage=634&rft_id=info:doi/10.1109%2FCLUSTER.2015.112&rft.externalDocID=7307661
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1552-5244&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1552-5244&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1552-5244&client=summon