Introduction to GPGPU Computing with CUDA and C++
Introduction
Nowadays some software programs become complex and need huge computation power with sequential programming this is a problem but with parallel processing, the speed of the programs can be increased because it exploits the parallelism provided by the multi-core architecture. So developers Shift in the traditional paradigm of sequential programming, toward parallel processing when there is a need for huge parallelism.
Nvidia’s graphic processing units have many cores and many treads to help parallel processing. In the beginning, these GPUs were used for graphical processing only but nowadays GPUs are also popular in non-graphical parallel processing tasks. For examples Physics modelling, Computational chemistry, GPU based rendering and Deep learning.
This difference between CPU and GPU is because they are made for different tasks. CPUs are made for executing sequences of operation with a much faster few cores count, whereas GPUs are made for parallel processing with slower but large cores count.
What is CUDA
CUDA is a general-purpose parallel computing platform and programming model provided in November 2006 by Nvidia.CUDA stands for Compute Unified Device Architecture. It is used to Develop applications for graphical processors and for general purpose applications that are highly parallel and require multiple CUDA cores such as multiplying matrices.
C++ in CUDA
CUDA comes with a software development environment that allows developers to use C++/C, Fortran and java/python wrappers to develop general-purpose applications using CUDA.
So C++ is just one way of developing applications using CUDA. with powerful features and performance with C++ language it allows developers to develop high performance massively parallel applications that are accelerated by a thread running on GPUs.
Prerequisites for programming with CUDA and C++
- C/C++ programming knowledge
- Nvidia CUDA capable GPU or cloud instance with GPU
- If you have a laptop with mobile Nvidia MX graphics or Nvidia dedicated GPU (GTX 1660 ti/Max-Q etc..) is enough.
- You can have AWS, GCP and Azure cloud GPU enabled instances.
- IF you don’t have access to both dedicated GPU or cloud instances you can also run CUDA application on google colab using GPU enabled notebook. Here is an awesome tutorial on how to run a Cuda application using colab with this link.
3. An IDE for coding. I prefer to use Visual Studio.
4. CUDA toolkit
- I will explain how to download and install the CUDA toolkit next.
How to install the CUDA toolkit
- Find the GPU(Architecture) version(If you don’t know how to find the GPU, Go to this link and follow the steps there and come back.).
- Next, find the corresponding CUDA toolkit version with this link.
- Then download the CUDA toolkit with this link.
Concepts behind Host & Device
In normal C/C++ programs are developed to run on a CPU using a compiler like GCC. but CUDA programs are developed to run on both CPU and GPU. So in these CUDA programs, the normal C/C++ code can be compiled on a traditional compiler like GCC but the codes that must compile using GPU are compiled using the Nvidia CUDA Compiler (NVCC).
Here Device is referred to as an Nvidia GPU and the Host is referred to as a CPU. So the code that runs on the CPU is called Host code and the code that runs on GPU is called Device code.GPU memory is called device memory and normal RAM is called host memory.
How the basic CUDA program works
Before writing our first CUDA program we should have a proper understanding of how the basic cuda program works.
- Develop the program with kernels
- Decide the Number of parallel cuda threads needed
- This is done by using a three-dimensional grid. This grid has blocks and each block has threads.
3. Allocate memory in GPU
- To execute the cuda code on the GPU we have to allocate the memory in the GPU.
Next, we will discuss each step in more detail.
Develop the program with kernels
We will write our program with cuda. And we will create separate functions to write our cuda codes. These functions are called kernels. Kernels can run parallelly using cuda threads that we are going to allocate in the next step.
These kernels are defined using __global__
declaration specifier, which indicates that this function is compiled on the device. So these functions are called from the host and executed on the device.
Example kernel function: -
Decide the Number of parallel cuda threads needed
To run the application parallelly we have to decide the number of cuda threads we need. Then we have to allocate this when we call the cuda code.
To run the application parallelly we have to decide the number of cuda threads we need. Then we have to allocate this when we call the cuda code.
To allocate these cuda threads we use a three-dimensional grid. It has blocks and each block has cuda treads.
- Grid — Each grid contains blocks. Communication between grids is expensive.
- Block — Each block contains threads. executes on a single streaming multiprocessor also called SM. Threads within a block have light-weight synchronization and also can exchange data.
- Thread — Threads can execute in a parallel manner.
When deciding the number of blocks and threads per block we have to consider those facts and,
- More threads per block mean more complexity in the program but also unlike blocks, threads inside a block have a way to synchronize and communicate. So depending on the program we are going to develop. we have to decide the number of blocks and the number of threads per block.
After deciding the number of blocks in the grid and the number of threads in a block we can call our kernel from host to device using execution configuration syntax.
This is execution configuration syntax:-
- Dg — Dimension and size of the grid (number of blocks in the grid).
- Db — Dimension and size of each block (number of threads in the block)
These parameters are also called kernel launch parameters.
Also, there are another two optional parameters, because this is a basic beginner tutorial I won’t mention those two.
Since these two variables are used to mention the dimensions we have to use the dim3 integer vector type. But sometimes we don’t use 3 dimensions, it can be 1 dimension. At that point, we can use integers.
For example, let’s say, make a 3-dimensional grid with 64 threads in 4 blocks, where each block contains 16 threads.
Dg = The number of blocks in the grid
Db = The number of threads per block
Let’s assume we have a kernel function called helllo_cuda()
we can call This function with our grid like below.
Allocate memory in GPU
To pass data between host and device we have to allocate memory in the device memory because the kernel function return type is void so, we can’t return data from a kernel function directly.
To pass data between host and device we have to use pointers. So we can allocate data in the memories and copy it from one to another using pointer.
To handle these memory allocations, cuda API provides a set of functions,
cudaMalloc()
— Similar to C++ malloc() function. Using this function we can allocate memory in the GPU.cudaMemcpy()
— Similar to C++ memcpy() function. Using this function we can copy data from host to device memory and device to host memory.cudaFree()
— Similar to C++ free() function. Using this function we can clean and free the memory.
A basic cuda program to get the sum of two numbers in Visual studio
- First, let’s create a visual studio project.
- Go to File -> new -> project then select Nvidia from the left side menu.
- Select the cuda version you installed.
- Give a name and location for the project and click ok. This will create a simple cuda program with cu extension.
2. Let’s remove everything inside kernel.cu file.
3. Next, we will create a simple program to add two values.
- Add necessary libraries
- Let’s create our kernel
- Let’s write our main method and let’s add int variables to store data in the host and pointers to copy data to the device and get data back from the device. Also the size of the integer. It will use when allocating memory
- Let’s allocate memory to the device using cudaMalloc()
- Let’s assign some values to n1 and n2 and copy them from host to device memory. Here we use cudaMemcpy(). It has 4 parameters
— The first one is where to copy.
— The second one is copping location.
— Size of the copping variable.
— Because we copy from host to device we mention it here using cudaMemcpyHostToDevice
.
- Next, let’s make 3dim for the number of blocks and number of threads per block and let’s call our kernel with our gird values.
- Then we can get the final output using our pointers. So looking at the kernel function you can see the sum of the two values assigned to sumData and you can see when we call our kernel we pass dSum for sumData parameter. So the last value is really stored in dSum pointer reference location in device memory. To get that value we have to copy it from the device memory to the host. We can use the same cudaMemcpy() function to do this. Because we copy from device to host we have to use the last parameter as in the function
cudaMemcpyDeviceToHost
. So we copy the value to sum variable. - Then we can print the value and freed up the device memory using
cudaFree()
function.
- We can then run our application in Visual studio. In the toolbar, you can click on the local windows debugger to compile and run.
- It will output our final value like below.
result is 25
- In this example, we really didn’t use the potential of cuda. Because we only calculate two values it can be done by using one thread. Next, let’s see how to run real parallel tasks.
Going Parallel
First, we have to understand another three concepts,
1 blockIdx.x — Using this we can access the index of a block
2 treadIdx.x — Using this we can access the index of a thread
3 blockDim.x — Using this we can access how much thread per block has
- With the above picture, you can see that we have 4 blocks and each block contains 8 threads to access the above array with 8 elements.
- To access each position with a unique index we can calculate it using blockIdx.x, threadIdx.x and blockDim.x
Index = threadIdx.x + blockIdx.x * blockDim.x
- Let’s see an example of how to access an element using an index.
- To access elements with the red colour we have to calculate using the above function. Let’s calculate.
- Next, let’s see an example.
Basic cuda program to calculate two arrays
1. First, we have to create a visual studio project like in the previous example.
2. Then remove everything inside the kernel.cu file.
3. Then add our libraries
4. Additionally, I will create a preprocessor directive variable to define the number of variables in each array. This is optional but it makes it easier to write the program.
5. Let’s assume each array has 200 elements.
#define N 200
6. Let’s create our kernel. Because each array has 200 elements, let’s make a grid with 200 threads later. So we can achieve maximum parallelism. Then each thread will calculate the final value using the array’s two elements parallelly. Also if there are more than 200 threads, arrays will throw the index outbound exception. To avoid that we have to use a condition.
7. Then we have to add values to arrays for that we have to create a function. because we don’t use variables but pointers we don’t have to return and assign those values.
8. Here is the main method. Like the previous example, we create the main method but this time we create only pointers
- So we have to allocate host memory for those elements and device memory. Using malloc() function in C++ we can allocate in host and using cudaMallol() we can allocate in the device.
- After that, we have to add data to arrays using datToArray() function that we defined.
- Then we copy arrays from host to device using cudaMemcopy() function.
Next, we create our thread grid with the number of blocks and the number of threads per block. - Next, we call our kernel with our thread grid.
- So after execution, we can copy like the previous example, from the device to the host using pointers.
- Then we can write a simple for loop to loop through the final array. Here we can also use another kernel to print final array values parallelly but it won’t print the array in order.
- Then we free the host memory and device memory.
- Here is the full code example.
I really hope you’ve liked this article and I am very keen on hearing your thoughts about it. Just give this article a comment and I’ll be more than happy to reply.
ENJOY YOUR CODING! 🚀