Hardware transactional memory for GPU architectures

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CU...

Full description

Saved in:
Bibliographic Details
Published in:MICRO 44 : Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture, December 4 - 7, 2011 Porto Alegre, RS - Brazil pp. 296 - 307
Main Authors: Fung, Wilson W. L., Singh, Inderpreet, Brownsword, Andrew, Aamodt, Tor M.
Format: Conference Proceeding
Language:English
Published: ACM 01.12.2011
Subjects:
Online Access:Get full text
Tags: Add Tag
No Tags, Be the first to tag this record!
Abstract Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloomfilter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128×faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.
AbstractList Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloomfilter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128×faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.
Author Singh, Inderpreet
Fung, Wilson W. L.
Aamodt, Tor M.
Brownsword, Andrew
Author_xml – sequence: 1
  givenname: Wilson W. L.
  surname: Fung
  fullname: Fung, Wilson W. L.
  email: wwlfung@ece.ubc.ca
  organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada
– sequence: 2
  givenname: Inderpreet
  surname: Singh
  fullname: Singh, Inderpreet
  email: isingh@ece.ubc.ca
  organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada
– sequence: 3
  givenname: Andrew
  surname: Brownsword
  fullname: Brownsword, Andrew
  email: andrew@brownsword.ca
  organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada
– sequence: 4
  givenname: Tor M.
  surname: Aamodt
  fullname: Aamodt, Tor M.
  email: aamodt@ece.ubc.ca
  organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada
BookMark eNotzE1Lw0AQgOEVFNSaswcv-QOpO7M7-3GUoq1Q0IM9l0kywUiTyG5E-u8V7ek5vPBeq_NxGkWpW9BLAEv3CEQO9fJPojNVRB9-gzagybhLVeT8obUGgEjOXSmz4dR-c5JyTjxmbuZ-GvlQDjJM6Vh2UyrXr7uSU_Pez9LMX0nyjbro-JClOLlQu6fHt9Wm2r6sn1cP24rR-rmyujNN9MYgBlPrNjTOOLC2FR0CMdaRI0IUTy2EziEa16IPUTCS1MGZhbr7__Yisv9M_cDpuPeBwAZtfgAVcENm
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/2155620.2155655
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library (IEL) (UW System Shared)
  url: https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 9781450310536
1450310532
EndPage 307
ExternalDocumentID 7851480
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
APO
CBEJK
GUFHI
LHSKQ
RIE
RIL
ID FETCH-LOGICAL-a247t-40f3c97332283b0d8c636144de0885a2b9a9219e75d18f62236d2789e295eb863
IEDL.DBID RIE
ISICitedReferencesCount 47
ISICitedReferencesURI http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000395298800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate Wed Aug 27 03:33:20 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a247t-40f3c97332283b0d8c636144de0885a2b9a9219e75d18f62236d2789e295eb863
PageCount 12
ParticipantIDs ieee_primary_7851480
PublicationCentury 2000
PublicationDate 2011-Dec.
PublicationDateYYYYMMDD 2011-12-01
PublicationDate_xml – month: 12
  year: 2011
  text: 2011-Dec.
PublicationDecade 2010
PublicationTitle MICRO 44 : Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture, December 4 - 7, 2011 Porto Alegre, RS - Brazil
PublicationTitleAbbrev MICRO
PublicationYear 2011
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0001119566
Score 2.1391058
Snippet Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a...
SourceID ieee
SourceType Publisher
StartPage 296
SubjectTerms Computer architecture
Design
Graphics processing units
Hardware
Instruction sets
Programming
Synchronization
System recovery
Title Hardware transactional memory for GPU architectures
URI https://ieeexplore.ieee.org/document/7851480
WOSCitedRecordID wos000395298800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV05T8MwGP1UKgZYOFrELQ-MpG1sx8eMKAyoykBRt8pXJCRIUZsW8e-x3ajJwMJky4tlW_bzd7z3AdzpQOdRxt80aYk3UDROpMY28YZYkerCY3RM-X974ZOJmM1k3oH7HRfGOReTz9wgdGMs3y7MOrjKhqGQPBXeQN_jnG25Wo0_JWiXMVar96Q0G3ow8-A-GsQ2UPla5VMieoyP_jfvMfQbGh7KdwBzAh1XnsJhS0GwByRE3r_V0qGqKfytPtBnSKD9Qf5Hip7yKWqHC1Z9mI4fXx-ek7oOQqIw5ZXfwYIYyQkJUjV6ZIVhJNhx1vknIlNYSyX9w-N4ZlNRMA_4zAaCq8Myc1owcgbdclG6c0Baak4Ly21qFE1NJinBRGtijFBCUnEBvbD8-ddW6mJer_zy7-ErOIgu1pjdcQ3darl2N7BvNtX7ankbz-cXY2KPyQ
linkProvider IEEE
linkToHtml http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8IwGH5D0ES9-AHGb3vw6IC13daejYgRFw5guC39WmIiw4yh8d_bFgI7ePHUppembdqn78fzvAB30tF5hLI3jWtiDRSJAy6xDqwhlocytxjtU_7fhkmasumUjxpwv-HCGGN88pnpuK6P5eu5WjpXWdcVkqfMGug7EaW4t2JrbT0qTr0sjtf6PSGNuhbOLLz3Or51ZL5aARWPH_3D_818BO0tEQ-NNhBzDA1TnMBBTUOwBcTF3r9FaVC1Lf0tPtDMpdD-IPsnRU-jCaoHDBZtmPQfxw-DYF0JIRCYJpXdw5wonhDixGpkTzMVE2fJaWMfiUhgyQW3T49JIh2yPLaQH2tHcTWYR0aymJxCs5gX5gyQ5DKhuU50qAQNVcQpwURKohQTjFN2Di23_OxzJXaRrVd-8ffwLewNxq_DbPicvlzCvne4-lyPK2hW5dJcw676qt4X5Y0_q18SKJMQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=MICRO+44+%3A+Proceedings+of+the+44th+Annual+IEEE%2FACM+Symposium+on+Microarchitecture%2C+December+4+-+7%2C+2011+Porto+Alegre%2C+RS+-+Brazil&rft.atitle=Hardware+transactional+memory+for+GPU+architectures&rft.au=Fung%2C+Wilson+W.+L.&rft.au=Singh%2C+Inderpreet&rft.au=Brownsword%2C+Andrew&rft.au=Aamodt%2C+Tor+M.&rft.date=2011-12-01&rft.pub=ACM&rft.spage=296&rft.epage=307&rft_id=info:doi/10.1145%2F2155620.2155655&rft.externalDocID=7851480