Configurable-ECC: Architecting a Flexible ECC Scheme to Support Different Sized Accesses in High Bandwidth Memory Systems

Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a m...

Celý popis

Uloženo v:
Podrobná bibliografie
Vydáno v:IEEE transactions on computers Ročník 68; číslo 5; s. 646 - 659
Hlavní autoři: Chen, Hsing-Min, Lee, Shin-Ying, Mudge, Trevor, Wu, Carole-Jean, Chakrabarti, Chaitali
Médium: Journal Article
Jazyk:angličtina
Vydáno: New York IEEE 01.05.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Témata:
ISSN:0018-9340, 1557-9956
On-line přístup:Získat plný text
Tagy: Přidat tag
Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!
Abstract Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17-21 percent while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2 percent performance overhead.
AbstractList Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially for systems equipped with graphics processing units (GPUs). The design of ECC is challenging because future GPUs are expected to implement a memory subsystem supporting fine and coarse-grained data accesses to match the difference in the spatial locality of GPGPU applications. Current ECC designs, however, are developed for a fixed data fetch granularity. To have a more flexible design, we propose a novel memory protection scheme, called Config(urable)-ECC, which provides strong reliability for both fine and coarse-grained data accesses. Config-ECC consists of two tiers of ECC protection. The tier-1 code is a strong product code that can correct errors due to small granularity faults and detect errors caused by large granularity faults. The tier-2 code is an XOR-based code that is employed to correct errors incurred by large granularity faults. Config-ECC provides stronger reliability and/or lower energy consumption compared to state-of-the-art fixed 32B and 64B ECC schemes. It reduces the HBM energy by 17-21 percent while reducing the failure in time (FIT) rate by 20 times compared to a state-of-the-art fixed 64B ECC scheme with an insignificant 1.2 percent performance overhead.
Author Lee, Shin-Ying
Chakrabarti, Chaitali
Mudge, Trevor
Wu, Carole-Jean
Chen, Hsing-Min
Author_xml – sequence: 1
  givenname: Hsing-Min
  orcidid: 0000-0003-2894-6503
  surname: Chen
  fullname: Chen, Hsing-Min
  email: hchen136@asu.edu
  organization: Intel Corporation, Santa Clara, CA, USA
– sequence: 2
  givenname: Shin-Ying
  orcidid: 0000-0001-6676-3121
  surname: Lee
  fullname: Lee, Shin-Ying
  email: lshinyin@asu.edu
  organization: Samsung, Austin, TX, USA
– sequence: 3
  givenname: Trevor
  surname: Mudge
  fullname: Mudge, Trevor
  email: tnm@umich.edu
  organization: Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
– sequence: 4
  givenname: Carole-Jean
  orcidid: 0000-0002-9032-7239
  surname: Wu
  fullname: Wu, Carole-Jean
  email: carole-jean.wu@asu.edu
  organization: School of Electrical, Computer and Energy Engineering, Arizon State University, Tempe, AZ, USA
– sequence: 5
  givenname: Chaitali
  surname: Chakrabarti
  fullname: Chakrabarti, Chaitali
  email: chaitali@asu.edu
  organization: School of Electrical, Computer and Energy Engineering, Arizon State University, Tempe, AZ, USA
BookMark eNp9kL1PwzAUxC1UJFpgZmCxxJxiO41js5VQPiQQQ8ocOc5za9TaxXYF5a8nqIiBgeXdcPd7J90IDZx3gNAZJWNKibycV2NGqBgzIbgQkwM0pEVRZlIWfICGpLcymU_IERrF-EoI4YzIIdpV3hm72AbVriCbVdUVnga9tAl0sm6BFb5dwYftTdybuNZLWANOHtfbzcaHhG-sMRDAJVzbT-jwVGuIESK2Dt_bxRJfK9e92y4t8ROsfdjhehcTrOMJOjRqFeH0R4_Ry-1sXt1nj893D9X0MdNMyJQpKTtpuCn7k7dtUYpcc9WB4Ey0RcsmnCkGXWsoNcx0UJZctkpSphUVeZ8-Rhf7v5vg37YQU_Pqt8H1lQ1jhHMiCC_61OU-pYOPMYBpNsGuVdg1lDTf-zbzqvnet_nZtyeKP4S2SSXrXQrKrv7hzvecBYDfFlGUnJc0_wJuMYl8
CODEN ITCOB4
CitedBy_id crossref_primary_10_1007_s11227_025_07503_4
crossref_primary_10_1109_JIOT_2024_3509525
crossref_primary_10_1109_TETC_2020_2965193
crossref_primary_10_1109_TR_2023_3303189
crossref_primary_10_1109_TR_2025_3550972
crossref_primary_10_1109_TVLSI_2024_3474791
crossref_primary_10_3390_jmse9121352
crossref_primary_10_1109_TVLSI_2025_3585971
Cites_doi 10.1145/2189750.2150989
10.1109/MM.2010.103
10.1109/IISWC.2013.6704684
10.1109/IISWC.2016.7581276
10.1109/MICRO.2014.62
10.1145/2304576.2304582
10.1109/MICRO.2014.57
10.1145/2540708.2540717
10.1145/2840807
10.1109/IISWC.2010.5650274
10.1109/TCSII.2013.2291091
10.1145/2957758
10.1109/ACSSC.2003.1292358
10.1109/ISSCC.2016.7418034
10.1109/ISCA.2012.6237047
10.1109/IISWC.2014.6983054
10.1109/TCOM.1984.1096175
10.1145/272991.272995
10.1109/L-CA.2011.4
10.1109/JRPROC.1961.287814
10.1145/2485922.2485958
10.1145/2694344.2694348
10.1109/HPCA.2016.7446094
10.1145/2830772.2830799
10.1109/DSN.2004.1311885
10.1109/HPCA.2012.6168947
10.1109/IISWC.2009.5306797
10.1145/1454115.1454152
10.1109/TEST.2014.7035318
10.1145/2503210.2503243
10.1145/2366231.2337192
10.1109/HPCA.2015.7056025
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TC.2018.2886884
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1557-9956
EndPage 659
ExternalDocumentID 10_1109_TC_2018_2886884
8576671
Genre orig-research
GrantInformation_xml – fundername: Defense Advanced Research Projects Agency
  funderid: 10.13039/100000185
– fundername: US National Science Foundation
  grantid: CCF-1618039
GroupedDBID --Z
-DZ
-~X
.DC
0R~
29I
4.4
5GY
6IK
85S
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
ACNCT
AENEX
AETEA
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
TWZ
UHB
UPT
XZL
YZZ
AAYXX
ABUFD
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c289t-a99d9f6f79f63bb5783c6ade8628b5b2462a2edbf11f2fde7769ba912ca183783
IEDL.DBID RIE
ISICitedReferencesCount 11
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000464129300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
ISSN 0018-9340
IngestDate Mon Jun 30 06:25:31 EDT 2025
Tue Nov 18 21:07:04 EST 2025
Sat Nov 29 01:35:40 EST 2025
Wed Aug 27 02:44:37 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c289t-a99d9f6f79f63bb5783c6ade8628b5b2462a2edbf11f2fde7769ba912ca183783
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-9032-7239
0000-0001-6676-3121
0000-0003-2894-6503
PQID 2206608065
PQPubID 85452
PageCount 14
ParticipantIDs crossref_primary_10_1109_TC_2018_2886884
ieee_primary_8576671
proquest_journals_2206608065
crossref_citationtrail_10_1109_TC_2018_2886884
PublicationCentury 2000
PublicationDate 2019-05-01
PublicationDateYYYYMMDD 2019-05-01
PublicationDate_xml – month: 05
  year: 2019
  text: 2019-05-01
  day: 01
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on computers
PublicationTitleAbbrev TC
PublicationYear 2019
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
(ref25) 2009
ref12
ref15
sridharan (ref16) 2012
ref11
ref10
ref17
ref19
(ref2) 2016
(ref1) 2015
koopman (ref33) 0
(ref26) 2015
ref45
ref48
ref47
ref42
ref41
ref43
chen (ref44) 2012
udipi (ref29) 2012
ref49
ref8
ref9
mekkat (ref46) 2013
macri (ref3) 2015
ref40
ref35
ref34
ref37
ref36
(ref14) 2013
ref31
ref30
ref32
ref39
(ref7) 2012
jacob (ref5) 2007
ref24
ref23
(ref38) 2011
ref20
ref22
ref21
rao (ref6) 1989
ref28
(ref27) 2014
lin (ref4) 2004
sridharan (ref18) 2013
References_xml – ident: ref17
  doi: 10.1145/2189750.2150989
– ident: ref28
  doi: 10.1109/MM.2010.103
– ident: ref40
  doi: 10.1109/IISWC.2013.6704684
– start-page: 1
  year: 2015
  ident: ref3
  article-title: AMD's next generation GPU and high bandwidth memory architecture: FURY
  publication-title: Proc IEEE Hot Chips 27 Symp
– ident: ref9
  doi: 10.1109/IISWC.2016.7581276
– ident: ref48
  doi: 10.1109/MICRO.2014.62
– year: 2015
  ident: ref1
  publication-title: High-bandwidth Memory (HBM) DRAM
– ident: ref10
  doi: 10.1145/2304576.2304582
– year: 2004
  ident: ref4
  publication-title: Error Control Coding
– year: 2012
  ident: ref7
  article-title: White paper - AMD Graphics Cores Next (GCN) Architecture
– ident: ref13
  doi: 10.1109/MICRO.2014.57
– year: 2007
  ident: ref5
  publication-title: Memory Systems Cache DRAM Disk
– ident: ref8
  doi: 10.1145/2540708.2540717
– ident: ref49
  doi: 10.1145/2840807
– start-page: 225
  year: 2013
  ident: ref46
  article-title: Managing shared last-level cache in a heterogeneous multicore processor
  publication-title: Proc 22nd IEEE/ACM Int Conf Parallel Architect Compilation Techn
– year: 2013
  ident: ref14
  publication-title: High-bandwidth Memory (HBM) DRAM
– ident: ref42
  doi: 10.1109/IISWC.2010.5650274
– year: 2011
  ident: ref38
  article-title: CUDA C/C++ SDK code samples v4.0
– year: 2014
  ident: ref27
  article-title: NVIDIA GeForce GTX 750 Ti: Featuring first-generation Maxwell GPU technology, designed for extreme performance per watt
– ident: ref22
  doi: 10.1109/TCSII.2013.2291091
– ident: ref11
  doi: 10.1145/2957758
– year: 2009
  ident: ref25
  article-title: NVIDIA's next generation CUDA compute architecture: Fermi
– ident: ref35
  doi: 10.1109/ACSSC.2003.1292358
– ident: ref15
  doi: 10.1109/ISSCC.2016.7418034
– year: 1989
  ident: ref6
  publication-title: Error-control coding for computer systems
– start-page: 1
  year: 2012
  ident: ref16
  article-title: A field study of DRAM errors
  publication-title: Proc Int Conf High Perform Comput Netw Storage Anal
– ident: ref24
  doi: 10.1109/ISCA.2012.6237047
– year: 2016
  ident: ref2
  article-title: Nvidia Tesla P100 - Whitepaper
  publication-title: WP-08019-001 v01 1
– ident: ref47
  doi: 10.1109/IISWC.2014.6983054
– ident: ref31
  doi: 10.1109/TCOM.1984.1096175
– ident: ref36
  doi: 10.1145/272991.272995
– ident: ref43
  doi: 10.1109/L-CA.2011.4
– ident: ref32
  doi: 10.1109/JRPROC.1961.287814
– ident: ref37
  doi: 10.1145/2485922.2485958
– ident: ref19
  doi: 10.1145/2694344.2694348
– start-page: 33
  year: 2012
  ident: ref44
  article-title: CACTI-3DD: Architecture-level modeling for 3D die-stacked DRAM main memory
  publication-title: Proc Des Autom Test Eur Conf Exhib
– ident: ref21
  doi: 10.1109/HPCA.2016.7446094
– ident: ref23
  doi: 10.1145/2830772.2830799
– ident: ref34
  doi: 10.1109/DSN.2004.1311885
– ident: ref45
  doi: 10.1109/HPCA.2012.6168947
– start-page: 1
  year: 2013
  ident: ref18
  article-title: Feng Shui of supercomputer memory: Positional effects in DRAM and SRAM faults
  publication-title: Proc Int Conf High Perform Comput Netw Storage Anal
– ident: ref41
  doi: 10.1109/IISWC.2009.5306797
– ident: ref39
  doi: 10.1145/1454115.1454152
– ident: ref12
  doi: 10.1109/TEST.2014.7035318
– ident: ref30
  doi: 10.1145/2503210.2503243
– start-page: 285
  year: 2012
  ident: ref29
  article-title: LOT-ECC: Localized and tiered reliability mechanisms for commodity memory systems
  publication-title: Proc Int Symp Comput Archit
  doi: 10.1145/2366231.2337192
– year: 2015
  ident: ref26
  article-title: The compute architecture of Intel processor graphics Gen9
– year: 0
  ident: ref33
  article-title: Best CRC polynomials
– ident: ref20
  doi: 10.1109/HPCA.2015.7056025
SSID ssj0006209
Score 2.3176506
Snippet Designing error correction code (ECC) to guarantee strong reliability for high bandwidth memory (HBM) is imperative in high performance computers, especially...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 646
SubjectTerms 3D DRAM
Bandwidth
Computer memory
Design
Energy consumption
error control coding and GPU
Error correction
Error correction codes
Fault detection
Graphics processing units
memory reliability
Random access memory
Reliability
Three-dimensional displays
Two dimensional displays
Title Configurable-ECC: Architecting a Flexible ECC Scheme to Support Different Sized Accesses in High Bandwidth Memory Systems
URI https://ieeexplore.ieee.org/document/8576671
https://www.proquest.com/docview/2206608065
Volume 68
WOSCitedRecordID wos000464129300002&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
journalDatabaseRights – providerCode: PRVIEE
  databaseName: IEEE Electronic Library (IEL)
  customDbUrl:
  eissn: 1557-9956
  dateEnd: 99991231
  omitProxy: false
  ssIdentifier: ssj0006209
  issn: 0018-9340
  databaseCode: RIE
  dateStart: 19680101
  isFulltext: true
  titleUrlDefault: https://ieeexplore.ieee.org/
  providerName: IEEE
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1Nb9QwEB2VigMcWmhBbCloDhw4kG3s9To2tyV0xQEqJBbUW-SPSRsJsmg3Cyq_HjvrrEDAgUsUyR-K9Dz2ODPzHsAzZ3PluPOZLWqeCStYprkVmRFSutxNve4JTD-9LS4u1OWlfr8HL3a1METUJ5_ROL72sXy_dJv4q-xMBedYxoLxW0Uht7Vau11XDukcLBjwROSJxofl-mxRxhQuNeZKSaXEbydQL6nyxz7cHy7zw__7rHtwkJxInG1Rvw971B7B4SDQgMlej-DuL2yDx3ATq_uaq80qVktl52X5EmdDGCF0QIPzyI4ZGjE0hlmu6Qtht8So_Bm8dHydxFQ6_ND8II-zXmyR1ti0GPNF8JVp_ffGd9f4Libw3mDiQ38AH-fni_JNlpQXMhcuYF1mtPa6lnURHhNrg1VPnDSewvVH2anlQnLDyduasZrXngIc2hrNuDMsMtRPHsJ-u2zpEaAOPp0ykhRz4cCshcmJpGB5rafkVOFGMB7QqFyiJY_qGJ-r_nqS62pRVhG-KsE3gue7AV-3jBz_7noc0dp1S0CN4HSAu0oWu6545LXPY5j55O-jHsOdMLfeJjuewn632tATuO2-dc169bRfjD8BgRrcqA
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3Pb9MwFH6aBhJwYGwDUTaGDzvsQLrYdR2bWwmrhugqJAraLfKPFxYJUtSmoO2vx06diolx4BJFsp1E-vLsZ7_3vg_g2JpUWmZdYrKSJdxwmihmeKK5EDa1Q6daAtMvk2w6lZeX6uMWvN7UwiBim3yG_XDbxvLd3K7CUdmp9M6xCAXj94JyVqzW2sy7okvooN6EBzyNRD40VaezPCRxyT6TUkjJb61BrajKXzNxu7yMd_7vw57A4-hGktEa913YwnoPdjqJBhItdg8e_cE3uA_Xob6v-rpahHqp5CzP35BRF0jwHYgm48CP6RuJb_RPucLvSJo5Cdqf3k8n76KcSkM-VTfoyKiVW8QlqWoSMkbIW127X5VrrshFSOG9JpER_Sl8Hp_N8vMkai8k1m_BmkQr5VQpysxfBsZ4ux5YoR36DZA0Q8O4YJqhMyWlJSsdZplQRivKrKaBo37wDLbreY3PgSjv1UktUFLrl8yS6xRRcJqWaohWZrYH_Q6NwkZi8qCP8a1oNyipKmZ5EeArInw9ONkM-LHm5Ph31_2A1qZbBKoHhx3cRbTZZcECs30aAs0v7h71Ch6czy4mxeT99MMBPPTvUevUx0PYbhYrfAn37c-mWi6O2h_zNyaG3_E
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Configurable-ECC%3A+Architecting+a+Flexible+ECC+Scheme+to+Support+Different+Sized+Accesses+in+High+Bandwidth+Memory+Systems&rft.jtitle=IEEE+transactions+on+computers&rft.au=Chen%2C+Hsing-Min&rft.au=Lee%2C+Shin-Ying&rft.au=Mudge%2C+Trevor&rft.au=Wu%2C+Carole-Jean&rft.date=2019-05-01&rft.pub=IEEE&rft.issn=0018-9340&rft.volume=68&rft.issue=5&rft.spage=646&rft.epage=659&rft_id=info:doi/10.1109%2FTC.2018.2886884&rft.externalDocID=8576671
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0018-9340&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0018-9340&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0018-9340&client=summon