Hardware transactional memory for GPU architectures

Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CU...

Full description

Saved in:

Bibliographic Details
Published in:	MICRO 44 : Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture, December 4 - 7, 2011 Porto Alegre, RS - Brazil pp. 296 - 307
Main Authors:	Fung, Wilson W. L., Singh, Inderpreet, Brownsword, Andrew, Aamodt, Tor M.
Format:	Conference Proceeding
Language:	English
Published:	ACM 01.12.2011
Subjects:	Computer architecture Design Graphics processing units Hardware Instruction sets Programming Synchronization System recovery
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Abstract	Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloomfilter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128×faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.
AbstractList	Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a relatively smaller set of single-instruction, multiple-thread (SIMT) cores to hide various long latency operations. While threads within a CUDA block/OpenCL workgroup can communicate efficiently through an intra-core scratchpad memory, threads in different blocks can only communicate via global memory accesses. Programmers wishing to exploit such communication have to consider data-races that may occur when multiple threads modify the same memory location. Recent GPUs provide a form of inter-block communication through atomic operations for single 32-bit/64-bit words. Although fine-grained locks can be constructed from these atomic operations, synchronization using locks is prone to deadlock. In this paper, we propose to solve these problems by extending GPUs to support transactional memory (TM). Major challenges include supporting 1000s of concurrent transactions and committing non-conflicting transactions in parallel. We propose KILO TM, a novel hardware TM design for GPUs that scales to 1000s of concurrent transactions. Without cache coherency hardware to depend on, it uses word-level, value-based conflict detection to avoid broadcast communication and reduce on-chip storage overhead. It employs speculative validation using a novel bloomfilter organization to increase transaction commit parallelism. For a set of TM-enhanced GPU applications, KILO TM captures 59% of the performance of fine-grained locking, and is on average 128×faster than executing all transactions serially, for an estimated hardware area overhead of 0.5% of a commercial GPU.
Author	Singh, Inderpreet Fung, Wilson W. L. Aamodt, Tor M. Brownsword, Andrew
Author_xml	– sequence: 1 givenname: Wilson W. L. surname: Fung fullname: Fung, Wilson W. L. email: wwlfung@ece.ubc.ca organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada – sequence: 2 givenname: Inderpreet surname: Singh fullname: Singh, Inderpreet email: isingh@ece.ubc.ca organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada – sequence: 3 givenname: Andrew surname: Brownsword fullname: Brownsword, Andrew email: andrew@brownsword.ca organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada – sequence: 4 givenname: Tor M. surname: Aamodt fullname: Aamodt, Tor M. email: aamodt@ece.ubc.ca organization: Dept. of Comput. & Electr. Eng., Univ. of British Columbia, Vancouver, BC, Canada
BookMark	eNotzE1Lw0AQgOEVFNSaswcv-QOpO7M7-3GUoq1Q0IM9l0kywUiTyG5E-u8V7ek5vPBeq_NxGkWpW9BLAEv3CEQO9fJPojNVRB9-gzagybhLVeT8obUGgEjOXSmz4dR-c5JyTjxmbuZ-GvlQDjJM6Vh2UyrXr7uSU_Pez9LMX0nyjbro-JClOLlQu6fHt9Wm2r6sn1cP24rR-rmyujNN9MYgBlPrNjTOOLC2FR0CMdaRI0IUTy2EziEa16IPUTCS1MGZhbr7__Yisv9M_cDpuPeBwAZtfgAVcENm
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1145/2155620.2155655
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library (IEL) (UW System Shared) url: https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISBN	9781450310536 1450310532
EndPage	307
ExternalDocumentID	7851480
Genre	orig-research
GroupedDBID	6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS APO CBEJK GUFHI LHSKQ RIE RIL
ID	FETCH-LOGICAL-a247t-40f3c97332283b0d8c636144de0885a2b9a9219e75d18f62236d2789e295eb863
IEDL.DBID	RIE
ISICitedReferencesCount	47
ISICitedReferencesURI	http://www.webofscience.com/api/gateway?GWVersion=2&SrcApp=Summon&SrcAuth=ProQuest&DestLinkType=CitingArticles&DestApp=WOS_CPL&KeyUT=000395298800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
IngestDate	Wed Aug 27 03:33:20 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a247t-40f3c97332283b0d8c636144de0885a2b9a9219e75d18f62236d2789e295eb863
PageCount	12
ParticipantIDs	ieee_primary_7851480
PublicationCentury	2000
PublicationDate	2011-Dec.
PublicationDateYYYYMMDD	2011-12-01
PublicationDate_xml	– month: 12 year: 2011 text: 2011-Dec.
PublicationDecade	2010
PublicationTitle	MICRO 44 : Proceedings of the 44th Annual IEEE/ACM Symposium on Microarchitecture, December 4 - 7, 2011 Porto Alegre, RS - Brazil
PublicationTitleAbbrev	MICRO
PublicationYear	2011
Publisher	ACM
Publisher_xml	– name: ACM
SSID	ssj0001119566
Score	2.1391058
Snippet	Graphics processor units (GPUs) are designed to efficiently exploit thread level parallelism (TLP), multiplexing execution of 1000s of concurrent threads on a...
SourceID	ieee
SourceType	Publisher
StartPage	296
SubjectTerms	Computer architecture Design Graphics processing units Hardware Instruction sets Programming Synchronization System recovery
Title	Hardware transactional memory for GPU architectures
URI	https://ieeexplore.ieee.org/document/7851480
WOSCitedRecordID	wos000395298800027&url=https%3A%2F%2Fcvtisr.summon.serialssolutions.com%2F%23%21%2Fsearch%3Fho%3Df%26include.ft.matches%3Dt%26l%3Dnull%26q%3D
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV05T8MwGP1UKgZYOFrELQ-MpG1sx8eMKAyoykBRt8pXJCRIUZsW8e-x3ajJwMJky4tlW_bzd7z3AdzpQOdRxt80aYk3UDROpMY28YZYkerCY3RM-X974ZOJmM1k3oH7HRfGOReTz9wgdGMs3y7MOrjKhqGQPBXeQN_jnG25Wo0_JWiXMVar96Q0G3ow8-A-GsQ2UPla5VMieoyP_jfvMfQbGh7KdwBzAh1XnsJhS0GwByRE3r_V0qGqKfytPtBnSKD9Qf5Hip7yKWqHC1Z9mI4fXx-ek7oOQqIw5ZXfwYIYyQkJUjV6ZIVhJNhx1vknIlNYSyX9w-N4ZlNRMA_4zAaCq8Myc1owcgbdclG6c0Baak4Ly21qFE1NJinBRGtijFBCUnEBvbD8-ddW6mJer_zy7-ErOIgu1pjdcQ3darl2N7BvNtX7ankbz-cXY2KPyQ
linkProvider	IEEE
linkToHtml	http://cvtisr.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1NT8IwGH5D0ES9-AHGb3vw6IC13daejYgRFw5guC39WmIiw4yh8d_bFgI7ePHUppembdqn78fzvAB30tF5hLI3jWtiDRSJAy6xDqwhlocytxjtU_7fhkmasumUjxpwv-HCGGN88pnpuK6P5eu5WjpXWdcVkqfMGug7EaW4t2JrbT0qTr0sjtf6PSGNuhbOLLz3Or51ZL5aARWPH_3D_818BO0tEQ-NNhBzDA1TnMBBTUOwBcTF3r9FaVC1Lf0tPtDMpdD-IPsnRU-jCaoHDBZtmPQfxw-DYF0JIRCYJpXdw5wonhDixGpkTzMVE2fJaWMfiUhgyQW3T49JIh2yPLaQH2tHcTWYR0aymJxCs5gX5gyQ5DKhuU50qAQNVcQpwURKohQTjFN2Di23_OxzJXaRrVd-8ffwLewNxq_DbPicvlzCvne4-lyPK2hW5dJcw676qt4X5Y0_q18SKJMQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=MICRO+44+%3A+Proceedings+of+the+44th+Annual+IEEE%2FACM+Symposium+on+Microarchitecture%2C+December+4+-+7%2C+2011+Porto+Alegre%2C+RS+-+Brazil&rft.atitle=Hardware+transactional+memory+for+GPU+architectures&rft.au=Fung%2C+Wilson+W.+L.&rft.au=Singh%2C+Inderpreet&rft.au=Brownsword%2C+Andrew&rft.au=Aamodt%2C+Tor+M.&rft.date=2011-12-01&rft.pub=ACM&rft.spage=296&rft.epage=307&rft_id=info:doi/10.1145%2F2155620.2155655&rft.externalDocID=7851480