Programming Massively Parallel Processors: A Hands-on Approach (Applications of GPU Computing Series)
Format: PDF / Kindle (mobi) / ePub
Programming Massively Parallel Processors discusses basic concepts about parallel programming and GPU architecture. ""Massively parallel"" refers to the use of a large number of processors to perform a set of computations in a coordinated parallel way. The book details various techniques for constructing parallel programs. It also discusses the development process, performance level, floating-point format, parallel patterns, and dynamic parallelism. The book serves as a teaching guide where parallel programming is the main topic of the course. It builds on the basics of C programming for CUDA, a parallel programming environment that is supported on NVI- DIA GPUs.
Composed of 12 chapters, the book begins with basic information about the GPU as a parallel computer source. It also explains the main concepts of CUDA, data parallelism, and the importance of memory access efficiency using CUDA.
The target audience of the book is graduate and undergraduate students from all science and engineering disciplines who need information about computational thinking and parallel programming.
- Teaches computational thinking and problem-solving techniques that facilitate high-performance parallel computing.
- Utilizes CUDA (Compute Unified Device Architecture), NVIDIA's software development tool created specifically for massively parallel environments.
- Shows you how to achieve both high-performance and high-reliability using the CUDA programming model as well as OpenCL.
presented in Chapter 19. The amount of effort needed to port an application into MPI, however, can be quite high due to the lack of shared memory across computing nodes. The programmer needs to perform domain decomposition to partition the input and output data into cluster nodes. Based on the domain decomposition, the programmer also needs to call message sending and receiving functions to manage the data exchange between nodes. CUDA, on the other hand, provides shared memory for parallel
to provide more control of memory management and allow copy operations to be initiated earlier and overlapped with other computations (although overlapped copies can be achieved through other means). When an array_view overlays storage on the host but is accessed on the accelerator, the data is copied to an unnamed array on that accelerator and the access is made to that array. This copy of the host data may persist for the remainder of the lifetime of the array_view. This allows the C++ AMP
are local to that block and should not be passed to child/parent kernels. Event handles are not guaranteed unique between blocks, so using an event handle within a block that did not allocate it will result in undefined behavior. Streams Both named and unnamed (NULL) streams are available under dynamic parallelism. Named streams may be used by any thread within a thread block, but stream handles should not be passed to other blocks or child/parent kernels. In other words, a stream should
to Work Efficiency in Parallel Algorithms Chapter Outline 9.1 Background 9.2 A Simple Parallel Scan 9.3 Work Efficiency Considerations 9.4 A Work-Efficient Parallel Scan 9.5 Parallel Scan for Arbitrary-Length Inputs 9.6 Summary 9.7 Exercises References Our next parallel pattern is prefix sum, which is also commonly known as scan. Parallel scan is frequently used to convert seemingly sequential operations, such as resource allocation, work assignment, and polynomial evaluation, into
is a vector of N constant values. The objective is to solve for the X variable values that will satisfy all the questions. An intuitive approach is to Invert the matrix so that X=A−1×(−Y). This can be done through methods such as Gaussian elimination for moderate-size arrays. While it is theoretically possible to solve equations represented in sparse matrices, the sheer size and the number of zero elements of many sparse linear systems of equations can simply overwhelm this intuitive approach.