Introduction to CUDA Programming: A Beginner's Guide

·

14 min read

This is an attempt to start a new series on CUDA programming! If you're curious about accelerating your computing tasks using GPUs, CUDA (Compute Unified Device Architecture) is an incredibly powerful tool to explore. Whether you're a developer, researcher, or hobbyist, understanding CUDA opens doors to high-performance computing.

In this series, we'll cover everything you need to know to begin leveraging CUDA effectively:

  1. Understanding CUDA: CUDA is a parallel computing platform and application programming interface model created by NVIDIA. It allows developers to use NVIDIA GPUs for general-purpose processing, which can greatly accelerate computations compared to traditional CPU-based processing.

  2. Where to Run CUDA Code: To run CUDA code, you typically need an NVIDIA GPU that supports CUDA. However, not everyone has access to such hardware. We'll explore alternatives like cloud-based services such as Lightning AI and Google Colab, which provide GPU resources for CUDA programming.

  3. Getting Started with C++ and CUDA: We'll start with the basics by writing a simple C++ program that sums two arrays on the CPU. This will serve as a foundation before moving on to GPU parallelization.

  4. Parallelizing with CUDA: You'll learn how to shift the array summation example from the CPU to the GPU using CUDA. This step-by-step process will introduce you to CUDA programming concepts such as kernel functions and memory management.

  5. Matrix Multiplication in C++ (CPU): We'll delve into matrix multiplication using the simplest algorithm on the CPU. This foundational understanding will help set the stage for optimizing with CUDA.

  6. Optimizing Matrix Multiplication with CUDA: Finally, we'll explore how CUDA can dramatically accelerate matrix multiplication. You'll see firsthand how to harness the parallel processing power of NVIDIA GPUs to achieve significant performance gains.

Whether you're aiming to accelerate scientific simulations, machine learning algorithms, or any other computationally intensive task, understanding CUDA opens up a world of possibilities. Follow along as we dive into each topic step by step, equipping you with the knowledge and skills to start leveraging CUDA effectively in your own projects.

Understanding CUDA

CUDA, short for Compute Unified Device Architecture, is a parallel computing platform and programming model developed by NVIDIA. It enables developers to harness the computational power of NVIDIA GPUs (Graphics Processing Units) for general-purpose processing tasks beyond traditional graphics rendering. CUDA allows programmers to write software that can execute on the GPU, leveraging its massively parallel architecture to accelerate applications in fields ranging from scientific computing and machine learning to computer vision and data analytics.

By offloading compute-intensive tasks to the GPU, CUDA can significantly boost performance and enable faster processing of complex algorithms compared to CPU-based solutions.

Where to run CUDA code

If you have an NVIDIA GPU in your machine, you can download and install the CUDA toolkit. You can follow the steps on their website.

If you, however, don't have an NVIDIA GPU, then there are a few cloud solutions which you can use to write and run your CUDA code.

  • Lightning AI Studios: You get 22 free GPU hours monthly of NVIDIA T4 GPU. It also comes with a VS Code like UI and access to the Linux terminal.

  • Google Colab: You can actually run your CUDA code on Google Colab. You might be wondering, how do I run C++ on Google Colab, stick around as we will be using Google Colab in this blog post.

  • Other GPU providers: I have added a GitHub Gist which contains a list of GPU providers, most of them are NOT FREE.

Getting Started with C++ and CUDA

Google Colab Notebook containing the source code is addedhere

We'll start with the basics by writing a simple C++ program that sums two arrays (containing 134 million elements) on the CPU. This will serve as a foundation before moving on to GPU parallelization.

**NOTE:**If you're coding everything on your machine then make sure you have GNU C/C++ compiler and clang installed, basically allowing you to access commands likeg++andclangfrom the command line. These should be fairly easy to setup on a Unix-like OS. On Windows, you can use something likeMinGW

If you're using Google Colab, then you can follow along very easily. Once inside the Google Colab notebook, you can connect to a NVIDIA T4 GPU instance. To change your runtime type, you can click on "Runtime" on the toolbar and choose "Change runtime type"

There, you can select T4 GPU under the Hardware accelerator section, hit save.

You can start your instance by clicking on the "Connect T4" button at the top right

Once connected, you can create a new "Code" cell and start typing. Here, a simple C++ program which creates 2 arrays of 134 million elements each (of float type) and sums them up. This code runs on the CPU

%%file sum_cpu.cc

#include <iostream>
#include <cmath>

void add(int n, float *x, float *y) {
  for(int i = 0; i < n; ++i) {
    y[i] = x[i] + y[i];
  }
}

int main() {
  int N = 1<<27;  // This becomes ~134M using left bitshift

  float *x = new float[N];
  float *y = new float[N];

  for(int i = 0; i < N; ++i) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  add(N, x, y);

  float max_error = 0.0f;
  for(int i = 0; i < N; ++i) {
      max_error = fmax(max_error, fabs(y[i] - 3.0f));
  }
  std::cout << "Max error: " << max_error << std::endl;

  delete [] x;
  delete [] y;

  return 0;
}

Make sure to add the %%file filename.cc as the first line of cell, this allows Google Colab to write this file in the filesystem. You can compile and run this by creating another "Code" cell and typing the following

%%shell
clang++ sum_cpu.cc -o sum_cpu
time ./sum_cpu

First line is %%shell which essentially tells Google Colab to treat all lines of code in this cell as shell commands. The line clang++ filename.cc -o filename essentially compiles the C++ source file into an executable. The last line time ./filename runs the executable and also measures the time it takes to execute.

Once you execute your code, you'll see that it gives us 0 error with running time of approx 2.6s

Let's harness the multiple cores of an NVIDIA GPU using CUDA.

Parallelizing with CUDA

You'll learn how to shift the array summation example from the CPU to the GPU using CUDA. This step-by-step process will introduce you to CUDA programming concepts such as kernel functions and memory management.

We need to turn our add function into a function which can be run by the GPU, called a kernel in CUDA.

To do this, we need to add the __global__ specified to the function. This tells the CUDA compiler that the function needs to run on the GPU, and can be called from the CPU code.

Memory allocation in CUDA

To compute on the GPU, memory must be allocated which is accessible to the GPU. To allocate data in unified memory space (Memory space accessible by both CPU and GPU), call cudaMallocManaged(), which returns a pointer that we can use to access host code (CPU) or device code (GPU). To free the data, just call cudaFree()

To run the CUDA kernel add() which invokes it on the GPU, we need to use the triple brackets syntax <<< >>>

**The CPU needs to wait until the kernel is done computing, as CUDA kernel launches don't block the calling CPU thread.**Just callcudaDeviceSynchronize()before the final error checking.

So the CUDA code looks as follows, notice the change in add function, how it's declared and invoked.

%%file sum_cuda.cu

#include <iostream>
#include <cmath>

__global__
void add(int n, float *x, float *y) {
  for(int i = 0; i < n; ++i) {
    y[i] = x[i] + y[i];
  }
}

int main() {
  int N = 1<<27;  // This becomes ~134M using left bitshift
  float *x, *y;

  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  for(int i = 0; i < N; ++i) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  add<<<1, 1>>>(N, x, y);

  // Wait for the GPU to finish
  cudaDeviceSynchronize();

  float max_error = 0.0f;
  for(int i = 0; i < N; ++i) {
      max_error = fmax(max_error, fabs(y[i] - 3.0f));
  }
  std::cout << "Max error: " << max_error << std::endl;

  cudaFree(x);
  cudaFree(y);

  return 0;
}

Compile and run the CUDA code using the nvcc command

%%shell
nvcc sum_cuda.cu -o sum_cuda
time ./sum_cuda

The command nvcc filename.cu -o filename invokes the NVIDIA CUDA Compiler on the file and the next line runs the executable, measuring it's time.

As we can see, the execution time increased instead of decreasing. Let's profile the execution using nvprof command to see how long does the execution take

%%shell
nvprof ./sum_cuda

The command nvprof ./filename profiles the executable and prints out a nice, verbose summary.

That's quite a lot to take in, but you can just focus on the line which starts with "GPU activities" and can see that the CUDA execution is taking almost 5.8s

Making use of threads

We ran a kernel with one thread, to do some computation. To make it parallel, we need to look into CUDA's <<<1, 1>>> syntax.

This is called execution configuration and it tells the CUDA runtime how many parallel threads to use on the GPU. There are 2 parameters here, let's start by changing the second one!

The 2nd parameter is the number of threads in a thread block. CUDA GPUs run kernels using blocks of threads that are multiples of 32 in size, so 256 threads is a reasonable size to choose.

With this, we also need to change the kernel in order to make full use of parallel threads. CUDA provides keywords that let kernels get the indices of running threads.

  • threadIdx.x contains the index of the current thread within its block

  • blockDim.x contains the number of threads in the block

Let's stride through the array with parallel threads

%%file sum_cuda.cu

#include <iostream>
#include <cmath>

__global__
void add(int n, float *x, float *y) {
  int index = threadIdx.x;
  int stride = blockDim.x;

  for(int i = index; i < n; i += stride) {
    y[i] = x[i] + y[i];
  }
}

int main() {
  int N = 1<<27;  // This becomes ~134M using left bitshift
  float *x, *y;

  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  for(int i = 0; i < N; ++i) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  add<<<1, 256>>>(N, x, y);

  // Wait for the GPU to finish
  cudaDeviceSynchronize();

  float max_error = 0.0f;
  for(int i = 0; i < N; ++i) {
      max_error = fmax(max_error, fabs(y[i] - 3.0f));
  }
  std::cout << "Max error: " << max_error << std::endl;

  cudaFree(x);
  cudaFree(y);

  return 0;
}
%%shell
nvcc sum_cuda.cu -o sum_cuda
time ./sum_cuda

%%shell
nvprof ./sum_cuda

As you can see, increasing the thread count resulted in a massive speedup. (5.8s to 443ms)

More info about blocks

Nvidia GPUs have many parallel processors grouped into Streaming Multiprocessors (SMs). Each SM can run multiple concurrent thread blocks.

Let's come back to the first parameter, it is the number of thread blocks. Blocks of parallel threads are called grid.

Since, there are N elements to process and 256 threads per block, we need to calculate the number of blocks to get atleast N threads. Divide N by the block size (round up in case N is not a multiple of block size)

Source:An even easier introduction to CUDA

Explanation: Approach to indexing into an 1D array

  • Each thread gets its index by computing the offset to the beginning of its block

    • Block index × Block size

    • blockIdx.x * blockDim.x

  • Adding the thread's index within the block

    • threadIdx.x
  • Set stride to total number of threads in the grid

    • blockDim.x * gridDim.x

    • Often called grid-stride loop

The kernel code has to be updated to take into account the entire grid of thread blocks.

  • gridDim.x contains the number of blocks in the grid

  • blockIdx.x contains the index of current thread block in the grid

%%file sum_cuda.cu

#include <iostream>
#include <cmath>

__global__
void add(int n, float *x, float *y) {
  int index = blockIdx.x * blockDim.x + threadIdx.x;
  int stride = blockDim.x * gridDim.x;

  for(int i = index; i < n; i += stride) {
    y[i] = x[i] + y[i];
  }
}

int main() {
  int N = 1<<27;  // This becomes ~134M using left bitshift
  float *x, *y;

  cudaMallocManaged(&x, N * sizeof(float));
  cudaMallocManaged(&y, N * sizeof(float));

  for(int i = 0; i < N; ++i) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  int block_size = 256;
  int num_blocks = (N + block_size - 1) / block_size;
  add<<<num_blocks, block_size>>>(N, x, y);

  // Wait for the GPU to finish
  cudaDeviceSynchronize();

  float max_error = 0.0f;
  for(int i = 0; i < N; ++i) {
      max_error = fmax(max_error, fabs(y[i] - 3.0f));
  }
  std::cout << "Max error: " << max_error << std::endl;

  cudaFree(x);
  cudaFree(y);

  return 0;
}
%%shell
nvcc sum_cuda.cu -o sum_cuda
time ./sum_cuda

%%shell
nvprof ./sum_cuda

Matrix Multiplication in C++ (CPU)

A common example, which I guess most people do when learning CUDA, is to do some matrix multiplication of fairly large square matrices (1024).

Let's write some C++ code which multiplies 2 square matrices of size 1024 each, using the standard matrix multiplication algorithm \(O(n^{3})\) time complexity

%%file matrix_multiplication.cc

#include <iostream>
#include <random>

// Global constants
const int N = 1 << 10;  // This line of code shifts 1 to the left by 10 bits (making it 1024)

// Function declarations live here
float get_random();
void init_matrix(float matrix[][N]);
void matmul(float A[][N], float B[][N], float C[][N]);

int main() {
  float (*A)[N] = new float[N][N];
  float (*B)[N] = new float[N][N];
  float (*C)[N] = new float[N][N];

  init_matrix(A);
  init_matrix(B);

  matmul(A, B, C);

  std::cout << A[0][0] << std::endl;
  std::cout << B[0][0] << std::endl;
  std::cout << C[0][0] << std::endl;

  // Free the memory
  delete[] A;
  delete[] B;
  delete[] C;

  return 0;
}

float get_random() {
  std::random_device device;  // Get a random device, as a seed source
  std::mt19937 generator(device()); // Use the random device to seed the generator
  std::uniform_real_distribution<float> distribution(0.0f, 1.0f); // Create a distribution, range [0, 1]

  return distribution(generator); // Return the random float
}

void init_matrix(float matrix[][N]) {
  for(int row = 0; row < N; ++row) {
    for(int col = 0; col < N; ++col) {
      matrix[row][col] = get_random();
    }
  }
}

void matmul(float A[][N], float B[][N], float C[][N]) {
  for(int i = 0; i < N; ++i) {
    for(int j = 0; j < N; ++j) {
      C[i][j] = 0.0f;

      for(int k = 0; k < N; ++k) {
        C[i][j] += A[i][k] * B[k][j];
      }
    }
  }
}
%%shell
clang++ matrix_multiplication.cc -o matrix_multiplication
time ./matrix_multiplication

This took a decent chunk of time to run on CPU, let's accelerate it using CUDA kernels.

Optimizing Matrix Multiplication with CUDA

Finally, we'll explore how CUDA can dramatically accelerate matrix multiplication. You'll see firsthand how to harness the parallel processing power of NVIDIA GPUs to achieve significant performance gains.

One of the ways to speed up this matrix multiplication is by assigning a thread to each element of the matrix. A good way to do this is to have a 2D grid of threads, we can index it using rows and columns.

%%file mat_mul.cu

#include <iostream>
#include <cassert>
#include <random>

float get_random();
void init_matrix(float *matrix, int N);
__global__ void matmul(float *A, float *B, float *C, int N);

int main() {
  int N = 1 << 10;  // This value is 1024, for square matrix dimension
  size_t bytes = N * N * sizeof(float); // Bytes required for the matrix

  float *A;
  float *B;
  float *C;
  cudaMallocManaged(&A, bytes);
  cudaMallocManaged(&B, bytes);
  cudaMallocManaged(&C, bytes);

  // Set our thread block and grid dimensions
  int threads = 16; // If we want 256 in the thread block, that is 16 * 16
  int blocks = (N + threads - 1) / threads;

  // Make a 2D grid of threads, pass a struct with 3 dims
  dim3 THREADS(threads, threads);
  dim3 BLOCKS(blocks, blocks);

  // Trigger the CUDA kernel
  matmul<<<THREADS, BLOCKS>>>(A, B, C, N);

  // Kernel watchers are async, we need to sync after the kernel is run
  cudaDeviceSynchronize();

  return 0;
}

float get_random() {
  std::random_device device;  // Get a random device, as a seed source
  std::mt19937 generator(device()); // Use the random device to seed the generator
  std::uniform_real_distribution<float> distribution(0.0f, 1.0f); // Create a distribution, range [0, 1]

  return distribution(generator); // Return the random float
}

void init_matrix(float *matrix, int N) {
  for(int i = 0; i < N * N; ++i) {
    matrix[i] = get_random();
  }
}

__global__ void matmul(float *A, float *B, float *C, int N) {
  // Calculate the global row and column for each thread
  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;

  // Boundary check for our matrix
  if(row < N && col < N) {
    // Accumulate result of a row * col
    float accumulate = 0.0f;
    for(int i = 0; i < N; ++i) {
      accumulate += A[row * N + i] * B[i * N + col];
    }

    C[row * N + col] = accumulate;
  }
}
%%shell
nvcc mat_mul.cu -o mat_mul
time ./mat_mul

%%shell
nvprof ./mat_mul

As we can see, offloading the processing to CUDA has trememdous gains and lowers our execution time (41.6s to 190ms)

This article wouldn't be possible without: