High Performance Parallelism Pearls Volume Two Multicore and Many-Core Programming Approaches

High Performance Parallelism Pearls Volume 2 offers another set of examples that demonstrate how to leverage parallelism.Similar to Volume 1, the techniques included here explain how to use processors and coprocessors with the same programming - illustrating the most effective ways to combine Xeon P...

Celý popis

Uloženo v:

Podrobná bibliografie
Hlavní autoři:	Jeffers, Jim, Reinders, James
Médium:	E-kniha
Jazyk:	angličtina
Vydáno:	Chantilly Elsevier Science & Technology 2015
Vydání:	1
Témata:	Data processing Parallel programming (Computer science)
ISBN:	0128038195, 9780128038192
On-line přístup:	Získat plný text
Tagy:	Přidat tag Žádné tagy, Buďte první, kdo vytvoří štítek k tomuto záznamu!

Obsah:

Prefetching -- Generating the code -- Performance results for QPhiX -- Other benefits -- The end of the road? -- For more information -- Chapter 10: Cosmic Microwave Background Analysis: Nested Parallelism in Practice -- Analyzing the CMB with Modal -- Optimization and modernization -- Splitting the loop into parallel tasks -- Introducing nested parallelism -- Nested OpenMP parallel regions -- OpenMP 4.0 teams -- Manual nesting -- Inner loop optimization -- Results -- Comparison of nested parallelism approaches -- Summary -- For more information -- Chapter 11: Visual Search Optimization -- Image-matching application -- Image acquisition and processing -- Scale-space extrema detection -- Keypoint localization -- Orientation assignment -- Keypoint descriptor -- Keypoint matching -- Applications -- Hospitality and retail industry -- Social interactions -- Surveillance -- A study of parallelism in the visual search application -- Database (DB) level parallelism -- Flann library parallelism -- Experimental evaluation -- Setup -- Database threads scaling -- Flann threads scaling -- KD-tree scaling with dbthreads -- Summary -- For more information -- Chapter 12: Radio Frequency Ray Tracing -- Background -- StingRay system architecture -- Optimization examples -- Parallel RF simulation with OpenMP -- Parallel RF visualization with ispc -- Summary -- Acknowledgments -- For more information -- Chapter 13: Exploring Use of the Reserved Core -- The Uintah computational framework -- Radiation modeling with the UCF -- Cross-compiling the UCF -- Toward demystifying the reserved core -- Exploring thread affinity patterns -- Thread placement with PThreads -- Implementing scatter affinity with PThreads -- Experimental discussion -- Machine configuration -- Simulation configuration -- Coprocessor-side results -- Host-side results -- Further analysis -- Summary
Front Cover -- High Performance Parallelism Pearls: Multicore and Many-core Programming Approaches -- Copyright -- Contents -- Contributors -- Acknowledgments -- Foreword -- Making a bet on many-core -- 2013 Stampede-Intel Many-Core System - A First -- HPC journey and revelation -- Stampede users discover: Its parallel programming -- This book is timely and important -- Preface -- Inspired by 61 cores: A new era in programming -- Chapter 1: Introduction -- Applications and techniques -- SIMD and vectorization -- OpenMP and nested parallelism -- Latency optimizations -- Python -- Streams -- Ray tracing -- Tuning prefetching -- MPI shared memory -- Using every last core -- OpenCL vs. OpenMP -- Power analysis for nodes and clusters -- The future of many-core -- Downloads -- For more information -- Chapter 2: Numerical Weather Prediction Optimization -- Numerical weather prediction: Background and motivation -- WSM6 in the NIM -- Shared-memory parallelism and controlling horizontal vector length -- Array alignment -- Loop restructuring -- Compile-time constants for loop and array bounds -- Performance improvements -- Summary -- For more information -- Chapter 3: WRF Goddard Microphysics Scheme Optimization -- The motivation and background -- WRF Goddard microphysics scheme -- Goddard microphysics scheme -- Benchmark setup -- Code optimization -- Removal of the vertical dimension from temporary variables for a reduced memory footprint -- Collapse i- and j-loops into smaller cells for smaller footprint per thread -- Addition of vector alignment directives -- Summary of the code optimizations -- Analysis using an instruction Mix report -- VTune performance metrics -- Performance effects of the optimization of Goddard microphysics scheme on the WRF -- Summary -- Acknowledgments -- For more information
Acknowledgments -- For more information -- Chapter 14: High Performance Python Offloading -- Background -- The pyMIC offload module -- Design of pyMIC -- The high-level interface -- The low-level interface -- Example: singular value decomposition -- GPAW -- Overview -- DFT algorithm -- Offloading -- PyFR -- Overview -- Runtime code generation -- Offloading -- Performance -- Performance of pyMIC -- GPAW -- PyFR -- Summary -- Acknowledgments -- For more information -- Chapter 15: Fast Matrix Computations on Heterogeneous Streams -- The challenge of heterogeneous computing -- Matrix multiply -- Basic matrix multiply -- Tiling for task concurrency -- Heterogeneous streaming: concurrency among computing domains -- Pipelining within a stream -- Stream concurrency within a computing domain -- Trade-offs in pipelining, tiling, and offload -- Small matrix performance -- Trade-offs in degree of tiling and number of streams -- Tiled hStreams algorithm -- The hStreams library and framework -- Features -- How it works -- Related work -- Cholesky factorization -- Performance -- LU factorization -- Continuing work on hStreams -- Acknowledgments -- Recap -- Summary -- For more information -- Tiled hStreams matrix multiplier example source -- Chapter 16: MPI-3 Shared Memory Programming Introduction -- Motivation -- MPIs interprocess shared memory extension -- When to use MPI interprocess shared memory -- 1-D ring: from MPI messaging to shared memory -- Modifying MPPTEST halo exchange to include MPI SHM -- Evaluation environment and results -- Summary -- For more information -- Chapter 17: Coarse-Grained OpenMP for Scalable Hybrid Parallelism -- Coarse-grained versus fine-grained parallelism -- Flesh on the bones: A FORTRAN "stencil-test" example -- Fine-grained OpenMP code -- Partial coarse-grained OpenMP code -- Fully coarse-grained OpenMP code
Optimization #6: Single thread affinity and CPU "isolation -- Optimization #7: Miscellaneous optimizations -- Results -- Conclusions -- For more information -- Chapter 8: Parallel Numerical Methods in Finance -- Overview -- Introduction -- Pricing equation for American option -- Initial C/C++ implementation -- Scalar optimization: Your best first step -- Compiler invocation switches -- Microarchitecture specification -- Floating point numeric operation control -- Transcendental functions -- Identify special cases to avoid unnecessary function call -- Use the correct parameter types -- Reuse as much as possible and reinvent as little as possible -- Subexpression evaluation -- SIMD parallelism-Vectorization -- Define and use vector data -- Vector arithmetic operations -- Vector function call -- Branch statements -- Calling the vector version and the scalar version of the program -- Loading to and storing from vector registers -- Calling vector version of the program -- Comparing vector and scalar version -- Vectorization by annotating the source code: #pragma SIMD -- C/C++ vector extension versus #pragma SIMD -- Thread parallelization -- Memory allocation in NUMA system -- Thread binding and affinity interface -- Scale from multicore to many-core -- Summary -- For more information -- Chapter 9: Wilson Dslash Kernel from Lattice QCD Optimization -- The Wilson-Dslash kernel -- Performance expectations -- Refinements to the model -- Additional tricks-compression -- First implementation and performance -- Running the naive code on Intel Xeon Phi coprocessor -- Evaluation of the naive code -- Optimized code: QPhiX and QphiX-Codegen -- Data layout for vectorization -- 3.5D blocking -- Load balancing -- SMT threading -- Lattice traversal -- Code generation with QphiX-Codegen -- QphiX-codegen code structure -- Implementing the instructions -- Generating Dslash
Performance results with the stencil code
Chapter 4: Pairwise DNA Sequence Alignment Optimization -- Pairwise sequence alignment -- Parallelization on a single coprocessor -- Multi-threading using OpenMP -- Vectorization using SIMD intrinsics -- Parallelization across multiple coprocessors using MPI -- Performance results -- Summary -- For more information -- Chapter 5: Accelerated Structural Bioinformatics for Drug Discovery -- Parallelism enables proteome-scale structural bioinformatics -- Overview of eFindSite -- Benchmarking dataset -- Code profiling -- Porting eFindSite for coprocessor offload -- Parallel version for a multicore processor -- Task-level scheduling for processor and coprocessor -- Case study -- Summary -- For more information -- Chapter 6: Amber PME Molecular Dynamics Optimization -- Theory of MD -- Acceleration of neighbor list building using the coprocessor -- Acceleration of direct space sum using the coprocessor -- Additional optimizations in coprocessor code -- Removing locks whenever possible -- Exclusion list optimization -- Reduce data transfer and computation in offload code -- Modification of load balance algorithm -- PME direct space sum and neighbor list work -- PME reciprocal space sum work -- Bonded force work -- Compiler optimization flags -- Results -- Conclusions -- For more information -- Chapter 7: Low-Latency Solutions for Financial Services Applications -- Introduction -- The opportunity -- Packet processing architecture -- The symmetric communication interface -- Memory registration -- Mapping remote memory via scif_mmap() -- Optimizing packet processing on the coprocessor -- Optimization #1: The right API for the job -- Optimization #2: Benefit from write combining (WC) memory type -- Optimization #3: "Pushing" versus "pulling" data -- Optimization #4: "Shadow" pointers for efficient FIFO management -- Optimization #5: Tickless kernels