Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Format: PDF / Kindle (mobi) / ePub
Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs.
Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA.
The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who need information about computational thinking and parallel programming.
- Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing.
- Utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments.
- Shows you how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.
local memory cannot be accessed by the host and it supports shared read/write access by all work items in a work group. The private memory of OpenCL corresponds to the CUDA automatic variables. 14.4 Kernel Functions OpenCL kernels have an identical basic structure as CUDA kernels. All OpenCL kernel declarations start with a __kernel keyword, which is equivalent to the __global__ keyword in CUDA. Figure 14.3 shows a simple OpenCL kernel that performs vector addition. Figure 14.3 A
language extensions, CUDA FORTRAN is FORTRAN with a similar set of language extensions. Before we jump into CUDA FORTRAN code, it is helpful to summarize some of differences between these two programming interfaces to the CUDA architecture. FORTRAN is a strongly typed language, and this strong typing carries over into the CUDA FORTRAN implementation. Device data declared in CUDA FORTRAN host code is declared with the device variable attribute, unlike CUDA C where both host and device data are
function calls at runtime allows recursion and will significantly ease the burden on programmers as they transition from legacy CPU-oriented algorithms toward GPU-tuned approaches for divide-and-conquer types of computation. This also allows easier implementation of graph algorithms where data structure traversal often naturally involves recursion. In some cases, developers will be able to “cut and paste” CPU algorithms into a CUDA kernel and obtain a reasonably performing kernel, although
one thread block, 80f thread-to-data mapping, 76–77 tiled kernel, 109–115, 110f tiled kernel, using shared memory, 110f, 112f warp scheduling, 90 MatrixMulKernel( ) function, 77–78 host code, 78f Row and Col in, 79–80 small execution example of, 79f Memory bandwidth, 3–4 Memory coalescing, 482–486, 484f, 484f Memory models, 461–464 configurable caching and scratchpad, 463 for CUDA applications, 463 for 3D simulation models, 463–464 for enhanced atomic operations, 464 enhanced
comparators. Figure 7.1 Excess-3 encoding, sorted by excess-3 ordering. Figure 7.1 also shows that the pattern of all 1’s in the excess representation is a reserved pattern. Note that a 0 value and an equal number of positive and negative values results in an odd number of patterns. Having the pattern 111 as either even number or odd number would result in an unbalanced number of even and odd numbers. The IEEE standard uses this special bit pattern in special ways that will be discussed