GPU Programming Projects



Getting started: vector addition


In this tutorial, you will learn how to launch a GPU kernel to perform a vector addition: c = a + b. On a CPU, this operation can be performed using a simple loop


        #pragma omp parallel for schedule(static) 
        for (int i = 0; i < n; i++) {
            c[i] = a[i] + b[i];
        }


and performance can be improved slighly with an OpenMP pragma. To execute this operation on a GPU, we will need to


  • allocate memory on the device

  • copy the vectors a and b from the host to the device

  • evaluate c = a + b on the device

  • copy the result vector c back to the host


  • Before moving any further, your should include the following headers at the top of your code

    
    #include<omp.h>
    #include<cuda.h>
    #include<cuda_runtime.h>
    

    Also, since your code will use the CUDA programming lanugague, your source file should use a *.cu extension. You can compile your CUDA-enabled code using a Makefile like this:

    
    LIB       = -lcublas -L/usr/local/cuda/lib64/  -liomp5 -L/edfs/users/edgroup/software/intel/lib/intel64
    NVCCFLAGS = -O2 -arch sm_20 -Xcompiler -fopenmp
    EXEC      = vecadd.x
    NVCC      = nvcc
    
    $(EXEC): vecadd.cu
        $(NVCC) $(NVCCFLAGS) $(LIB) $(INCLUDE) vecadd.cu -o $(EXEC) 
    
    clean:
        rm -f *.o *.x
    



    Step 1. Allocate memory on the device


    Obviously, we'll need a buffer in which the vectors a, b, and c will reside on the device. Use the cudaMalloc function to allocate GPU memory

    
        // pointers to gpu memory 
        double * aGPU;
        double * bGPU;
        double * cGPU;
    
        // allocate memory on gpu
        cudaMalloc((void**)&aGPU,n*sizeof(double));
        cudaMalloc((void**)&bGPU,n*sizeof(double));
        cudaMalloc((void**)&cGPU,n*sizeof(double));
    


    Here, n is the dimension of the vectors.


    Step 2. Copy data to the GPU


    In this tutorial, it is assumed that a and b where initially assigned on the host in existing buffers a and b, and the data must be transferred to the device using the cudaMemcpy call

    
        // copy data from cpu to gpu memory
        cudaMemcpy(aGPU,a,n*sizeof(double),cudaMemcpyHostToDevice);
        cudaMemcpy(bGPU,b,n*sizeof(double),cudaMemcpyHostToDevice);
    


    The enum cudaMemcpyHostToDevice specifies that data is copied from host memory to device memory. We'll do the reverse, below, using cudaMemcpyDeviceToHost.


    Step 3. Evaluate c = a + b on the device


    Now, we must lauch a kernel to evaluate c = a + b on the device. When a kernel is launched on the device, many GPU threads excecute exactly the same code simultaneously. So, one of the challenges of GPU programming is ensuring that each thread is performing whatever task it is supposed to be performing. For a simple vector addition, it makes sense have each thread performing the addition for one element of c . A kernel launch might look like this:

    
        vecadd_gpu_bythread<<<1,n>>>(aGPU,bGPU,cGPU);
    


    Note that the kernel launch involves <<<1,n>>>. This syntax states that the GPU kernel will be evaluated on one "block" of n "threads." In general, many threads on a GPU can be lumped together into a block, and blocks can be lumped together into a "grid." The Cuda code to evaluate c = a + b, using one block and n threads, might look like this

    
    // C = A + B, evaluated on a single block with n threads
    __global__ void vecadd_gpu_bythread(double * a, double * b, double * c) {
    
        c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
    }
    


    There are two things to note here. First, the type __global__ specifies that this function is a GPU function to be executed on the device and called by the host. Second, the kernel is executed by all threads simultaneously, and each thread has a unique index given by the built-in variable threadIdx.x.

    We could have also launched a kernel that evaluated the vector sum on multiple blocks instead of threads within single block. That call might look like:

    
        vecadd_gpu_byblock<<<n,1>>>(aGPU,bGPU,cGPU);
    


    and the corresponding Cuda code is

    
    // C = A + B, evaluated on a n blocks, each with a single thread
    __global__ void vecadd_gpu_byblock(double * a, double * b, double * c) {
    
        c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
    }
    


    Now, each thread can be identified by the built-in identifier, blockIdx.x. Both of these strategies are severely limited in that there is an upper bound to the value of n that can be used here. The maximum number of threads per block (and other useful information) can be determined by the CPU using the following code:

    
        struct cudaDeviceProp cudaProp;
        int gpu_id;
        cudaGetDevice(&gpu_id);
        cudaGetDeviceProperties( &cudaProp,gpu_id );
        printf("\n");
        printf("        _________________________________________________________\n");
        printf("        CUDA device properties:\n");
        printf("        name:                 %20s\n",cudaProp.name);
        printf("        major version:        %20d\n",cudaProp.major);
        printf("        minor version:        %20d\n",cudaProp.minor);
        printf("        canMapHostMemory:     %20d\n",cudaProp.canMapHostMemory);
        printf("        totalGlobalMem:       %20lu mb\n",
          cudaProp.totalGlobalMem/(1024*1024));
        printf("        sharedMemPerBlock:    %20lu\n",cudaProp.sharedMemPerBlock);
        printf("        clockRate:            %20.3f ghz\n",
          cudaProp.clockRate/1.0e6);
        printf("        regsPerBlock:         %20d\n",cudaProp.regsPerBlock);
        printf("        warpSize:             %20d\n",cudaProp.warpSize);
        printf("        maxThreadsPerBlock:   %20d\n",cudaProp.maxThreadsPerBlock);
        printf("        _________________________________________________________\n");
        printf("\n");
    


    For an NVIDIA K40c (Kepler) GPU, the maximum number of threads per block is 1024, so the function vecadd_gpu_bythread will fail with n > 1024. Similarly, the number of blocks per dimension is 65535, so vecadd_gpu_byblock will fail with n > 65535 (we'll cover what I mean by dimensions below). We can squeeze more life out of this strategy by executing the kernel on multiple blocks with multiple threads. For best performance, the number of threads should be a multiple of the "warp size," which is 32 on a K40c GPU. For a general application, you should optimize this number for performance; here, we'll just use 32.

    
        int nthreads = 32;
        int nblocks  = (int)n/nthreads + 1;
        vecadd_gpu_by_blocks_and_threads<<<nblocks,nthreads>>>(n,aGPU,bGPU,cGPU);
    


    You might guess that nthreads * nblocks might actually be larger than the dimension of the array; in other words, we might be launching more threads than are necessary for the target operation. So, unlike in previous calls, the dimension of the vectors, n, must be an argument in the present call. The corresponding Cuda code might look like

    
    // C = A + B, evaluated on at least 1 block with, each with at least 1 thread
    __global__ void vecadd_gpu_by_blocks_and_threads(int n, double * a, double * b, double * c) {
    
        int id = blockIdx.x * blockDim.x + threadIdx.x;
        if ( id >= n ) return;
    
        c[id] = a[id] + b[id];
    
    }
    


    We passed the dimension of the array to the kernel so each thread can check that its composite index (blockIdx.x * blockDim.x + threadIdx.x) falls within the bounds of the array. Note that we're using another built-in variable, blockDim.x. As with our other kernels, though, this one will eventually run out of steam for large n. Even with nblocks = 65535 and nthreads = 1024, this function, as currently written can only accomodate a vector addition for double precision array, c, that is about 1/2 GB in size. A much more robust indexing scheme emerges when we consider a two-dimensional grid of blocks, each with multiple threads.

    
        // threads per block should be multiple of the warp
        // size (32) and has max value cudaProp.maxThreadsPerBlock
        int threads_per_block = 32;
        int maxblocks         = 65535;
    
        int nblocks_x = n / threads_per_block;
        int nblocks_y = 1;
    
        if ( n % threads_per_block != 0 ) {
           nblocks_x = (n + threads_per_block - n % threads_per_block ) / threads_per_block;
        }
    
        if (nblocks_x > maxblocks){
           nblocks_y = nblocks_x / maxblocks + 1;
           nblocks_x = nblocks_x / nblocks_y + 1;
        }
    
        // a two-dimensional grid: nblocks_x by nblocks_y
        dim3 dimgrid (nblocks_x,nblocks_y);
    
        vecadd_gpu_2d_grid<<<dimgrid,threads_per_block>>>(n,aGPU,bGPU,cGPU);
    


    The variable "dimgrid" is an integer triplet specifying the dimensions of our grid of blocks. Here, we have specified a 2-dimensional grid, so dimgrid is ( nblocks_x, nblocks_y, 1 ). While our block grid is multidimensional, we launch the kernel in a way that spawns threads in only one dimension in each block. Technically, we could have launched threads in three-dimensions, but maximum number of threads per block would still be 1024, and doing so would only complicate our indexing scheme. The relevant Cuda kernel is

    
    // C = A + B, evaluated on a 2-dimensional grid of blocks, each with 
    // at least one thread.
    __global__ void vecadd_gpu_2d_grid(int n, double * a, double * b, double * c) {
    
        int blockid = blockIdx.x*gridDim.y + blockIdx.y;
        int id      = blockid*blockDim.x + threadIdx.x;
        if ( id >= n ) return;
    
        c[id] = a[id] + b[id];
    
    }
    


    This code is really no more complicated than any of the other kernels we've encountered. The blockid identifies the block to which the current thread belongs, and the composite index involves this blockid and the thread index. Again, we must ensure that each thread's index falls within the bounds of the arrays.


    Step 4. Copy data back to the CPU


    If your application requires that the result vector, c, be used on the host at some point, you must copy it back from the device

    
        // copy result back to host from deveice:
        cudaMemcpy(c,cGPU,n*sizeof(double),cudaMemcpyDeviceToHost);
    



    Step 5. Timing comparisons


    Consider again the host code that launched the vector addition kernel using a 2-dimensional grid of blocks. This time, let's evaluate the kernel 1000 times so we can trust the timings.

    
        // threads per block should be multiple of the warp
        // size (32) and has max value cudaProp.maxThreadsPerBlock
        int threads_per_block = 32;
        int maxblocks         = 65535;
    
        long int nblocks_x = n / threads_per_block;
        long int nblocks_y = 1;
    
        if ( n % threads_per_block != 0 ) {
           nblocks_x = (n + threads_per_block - n % threads_per_block ) / threads_per_block;
        }
    
        if (nblocks_x > maxblocks){
           nblocks_y = nblocks_x / maxblocks + 1;
           nblocks_x = nblocks_x / nblocks_y + 1;
        }
    
        // a two-dimensional grid: nblocks_x by nblocks_y
        dim3 dimgrid (nblocks_x,nblocks_y);
    
        double start = omp_get_wtime();
    
        for (int i = 0; i < 1000; i++) {
            vecadd_gpu_2d_grid<<<dimgrid,threads_per_block>>>(n,aGPU,bGPU,cGPU);
        }
        // don't forget to block the CPU code from proceding
        cudaThreadSynchronize();
    
        double end = omp_get_wtime();
        double gputime = end-start;
    


    Note that we have also included a call to cudaThreadSynchronize() after the loop where the kernel was launched. This is done because when a GPU kernel is launched, the CPU will procede with its own instructions unless there is some code to serve as a barrier; there are several ways to accomplish this. In another tutorial, we will explore the use of "Cuda Events" to monitor GPU progress when performing work on the host and device simultaneously.

    The following graph illustrates the ratio of the times required for this vector addition performed (1000 times) using the CPU code at the beginning of this tutorial (with six OpenMP threads) and the GPU code outlined just above here. Results are presented for array sizes in the range 1 ≤ n ≤ 100,000,000.