Professional CUDA® C programming

Professional CUDA Programming in C provides down to earth coverage of the complex topic of parallel computing, a topic increasingly essential in every day computing. This entry-level programming book for professionals turns complex subjects into easy-to-comprehend concepts and easy-to-follows steps.

Saved in:

Bibliographic Details
Main Authors:	Cheng, John, Grossman, Max, McKercher, Ty
Format:	eBook Book
Language:	English
Published:	Hoboken Wiley 2014 Wrox John Wiley & Sons, Incorporated Wiley-Blackwell Wrox, John Wiley & Sons, Inc
Edition:	1
Series:	Wrox programmer to programmer
Subjects:	Application software Application software > Development COMPUTERS CUDA (Computer architecture) Graphics processing units Parallel Parallel processing (Electronic computers) Parallel programming (Computer science) Programming
ISBN:	9781118739273, 9781118739327, 1118739272, 9781118739310, 1118739329, 1118739310
Online Access:	Get full text
Tags:	Add Tag No Tags, Be the first to tag this record!

Table of Contents:

Professional CUDA® C programming -- Credits -- About the Authors -- About the Technical Editors -- Acknowledgments -- Contents -- Chapter 1: Heterogeneous Parallel Computing with CUDA -- Chapter 2: CUDA Programming Model -- Chapter 3: CUDA Execution Model -- Chapter 4: Global Memory -- Chapter 5: Shared Memory and Constant Memory -- Chapter 6: Streams and Concurrency -- Chapter 7: Tuning Instruction-Level Primitives -- Chapter 8: GPU-Accelerated CUDA Libraries and OpenACC -- Chapter 9: Multi-GPU Programming -- Chapter 10: Implementation Considerations -- Appendix: Suggested Readings -- Index.
Cover -- Title Page -- Copyright -- Contents -- Chapter 1 Heterogeneous Parallel Computing with CUDA -- Parallel Computing -- Sequential and Parallel Programming -- Parallelism -- Computer Architecture -- Heterogeneous Computing -- Heterogeneous Architecture -- Paradigm of Heterogeneous Computing -- CUDA: A Platform for Heterogeneous Computing -- Hello World from GPU -- Is CUDA C Programming Difficult? -- Summary -- Chapter 2 CUDA Programming Model -- Introducing the CUDA Programming Model -- CUDA Programming Structure -- Managing Memory -- Organizing Threads -- Launching a CUDA Kernel -- Writing Your Kernel -- Verifying Your Kernel -- Handling Errors -- Compiling and Executing -- Timing Your Kernel -- Timing with CPU Timer -- Timing with nvprof -- Organizing Parallel Threads -- Indexing Matrices with Blocks and Threads -- Summing Matrices with a 2D Grid and 2D Blocks -- Summing Matrices with a 1D Grid and 1D Blocks -- Summing Matrices with a 2D Grid and 1D Blocks -- Managing Devices -- Using the Runtime API to Query GPU Information -- Determining the Best GPU -- Using nvidia-smi to Query GPU Information -- Setting Devices at Runtime -- Summary -- Chapter 3 CUDA Execution Model -- Introducing the CUDA Execution Model -- GPU Architecture Overview -- The Fermi Architecture -- The Kepler Architecture -- Profile-Driven Optimization -- Understanding the Nature of Warp Execution -- Warps and Thread Blocks -- Warp Divergence -- Resource Partitioning -- Latency Hiding -- Occupancy -- Synchronization -- Scalability -- Exposing Parallelism -- Checking Active Warps with nvprof -- Checking Memory Operations with nvprof -- Exposing More Parallelism -- Avoiding Branch Divergence -- The Parallel Reduction Problem -- Divergence in Parallel Reduction -- Improving Divergence in Parallel Reduction -- Reducing with Interleaved Pairs -- Unrolling Loops
Introducing Streams and Events -- CUDA Streams -- Stream Scheduling -- Stream Priorities -- CUDA Events -- Stream Synchronization -- Concurrent Kernel Execution -- Concurrent Kernels in Non-NULL Streams -- False Dependencies on Fermi GPUs -- Dispatching Operations with OpenMP -- Adjusting Stream Behavior Using Environment Variables -- Concurrency-Limiting GPU Resources -- Blocking Behavior of the Default Stream -- Creating Inter-Stream Dependencies -- Overlapping Kernel Execution and Data Transfer -- Overlap Using Depth-First Scheduling -- Overlap Using Breadth-First Scheduling -- Overlapping GPU and CPU Execution -- Stream Callbacks -- Summary -- Chapter 7 Tuning Instruction-Level Primitives -- Introducing CUDA Instructions -- Floating-Point Instructions -- Intrinsic and Standard Functions -- Atomic Instructions -- Optimizing Instructions for Your Application -- Single-Precision vs. Double-Precision -- Standard vs. Intrinsic Functions -- Understanding Atomic Instructions -- Bringing It All Together -- Summary -- Chapter 8 GPU-Accelerated CUDA Libraries and OpenACC -- Introducing the CUDA Libraries -- Supported Domains for CUDA Libraries -- A Common Library Workflow -- The CUSPARSE Library -- cuSPARSE Data Storage Formats -- Formatting Conversion with cuSPARSE -- Demonstrating cuSPARSE -- Important Topics in cuSPARSE Development -- cuSPARSE Summary -- The cuBLAS Library -- Managing cuBLAS Data -- Demonstrating cuBLAS -- Important Topics in cuBLAS Development -- cuBLAS Summary -- The cuFFT Library -- Using the cuFFT API -- Demonstrating cuFFT -- cuFFT Summary -- The cuRAND Library -- Choosing Pseudo- or Quasi- Random Numbers -- Overview of the cuRAND Library -- Demonstrating cuRAND -- Important Topics in cuRAND Development -- CUDA Library Features Introduced in CUDA 6 -- Drop-In CUDA Libraries -- Multi-GPU Libraries
Reducing with Unrolling -- Reducing with Unrolled Warps -- Reducing with Complete Unrolling -- Reducing with Template Functions -- Dynamic Parallelism -- Nested Execution -- Nested Hello World on the GPU -- Nested Reduction -- Summary -- Chapter 4 Global Memory -- Introducing the CUDA Memory Model -- Benefits of a Memory Hierarchy -- CUDA Memory Model -- Memory Management -- Memory Allocation and Deallocation -- Memory Transfer -- Pinned Memory -- Zero-Copy Memory -- Unified Virtual Addressing -- Unified Memory -- Memory Access Patterns -- Aligned and Coalesced Access -- Global Memory Reads -- Global Memory Writes -- Array of Structures versus Structure of Arrays -- Performance Tuning -- What Bandwidth Can a Kernel Achieve? -- Memory Bandwidth -- Matrix Transpose Problem -- Matrix Addition with Unified Memory -- Summary -- Chapter 5 Shared Memory and Constant Memory -- Introducing CUDA Shared Memory -- Shared Memory -- Shared Memory Allocation -- Shared Memory Banks and Access Mode -- Configuring the Amount of Shared Memory -- Synchronization -- Checking the Data Layout of Shared Memory -- Square Shared Memory -- Rectangular Shared Memory -- Reducing Global Memory Access -- Parallel Reduction with Shared Memory -- Parallel Reduction with Unrolling -- Parallel Reduction with Dynamic Shared Memory -- Effective Bandwidth -- Coalescing Global Memory Accesses -- Baseline Transpose Kernel -- Matrix Transpose with Shared Memory -- Matrix Transpose with Padded Shared Memory -- Matrix Transpose with Unrolling -- Exposing More Parallelism -- Constant Memory -- Implementing a 1D Stencil with Constant Memory -- Comparing with the Read-Only Cache -- The Warp Shuffle Instruction -- Variants of the Warp Shuffle Instruction -- Sharing Data within a Warp -- Parallel Reduction Using the Warp Shuffle Instruction -- Summary -- Chapter 6 Streams and Concurrency
Parallelizing crypt -- Optimizing crypt -- Deploying Crypt -- Summary of Porting crypt -- Summary -- Appendix: Suggested Readings -- Index -- Advertisement -- EULA
A Survey of CUDA Library Performance -- cuSPARSE versus MKL -- cuBLAS versus MKL BLAS -- cuFFT versus FFTW versus MKL -- CUDA Library Performance Summary -- Using OpenACC -- Using OpenACC Compute Directives -- Using OpenACC Data Directives -- The OpenACC Runtime API -- Combining OpenACC and the CUDA Libraries -- Summary of OpenACC -- Summary -- Chapter 9 Multi-GPU Programming -- Moving to Multiple GPUs -- Executing on Multiple GPUs -- Peer-to-Peer Communication -- Synchronizing across Multi-GPUs -- Subdividing Computation across Multiple GPUs -- Allocating Memory on Multiple Devices -- Distributing Work from a Single Host Thread -- Compiling and Executing -- Peer-to-Peer Communication on Multiple GPUs -- Enabling Peer-to-Peer Access -- Peer-to-Peer Memory Copy -- Peer-to-Peer Memory Access with Unified Virtual Addressing -- Finite Difference on Multi-GPU -- Stencil Calculation for 2D Wave Equation -- Typical Patterns for Multi-GPU Programs -- 2D Stencil Computation with Multiple GPUs -- Overlapping Computation and Communication -- Compiling and Executing -- Scaling Applications across GPU Clusters -- CPU-to-CPU Data Transfer -- GPU-to-GPU Data Transfer Using Traditional MPI -- GPU-to-GPU Data Transfer with CUDA-aware MPI -- Intra-Node GPU-to-GPU Data Transfer with CUDA-Aware MPI -- Adjusting Message Chunk Size -- GPU to GPU Data Transfer with GPUDirect RDMA -- Summary -- Chapter 10 Implementation Considerations -- The CUDA C Development Process -- APOD Development Cycle -- Optimization Opportunities -- CUDA Code Compilation -- CUDA Error Handling -- Profile-Driven Optimization -- Finding Optimization Opportunities Using nvprof -- Guiding Optimization Using nvvp -- NVIDIA Tools Extension -- CUDA Debugging -- Kernel Debugging -- Memory Debugging -- Debugging Summary -- A Case Study in Porting C Programs to CUDA C -- Assessing crypt