Playing with CUDA block size
Recently I was working on algorithm implementation using NVIDIA CUDA. For testing reasons I was using a tiny-toy data sample to check whether the algorithm worked as expected. I was concentrated on what I was doing, not on how it has to be done. And optimization was the last thing to do.
After I have reached the point when things went as supposed to, I faced the problem of CPU implementation working much faster than my CUDA one. And then I suddenly realized that while I was playing with the implementation, I have set the block size to 1 and grid size to the total number of threads I wanted to be run.
After I have fine-tuned the kernels invocation range, algorithm execution time dropped dramatically. It became absolutely clear to me why there was such a gap in the performance with a block size of 1. There is no SIMT advantage in this case, so threads run in a parallel way as it is possible to run multiple blocks/warps at the same time. However there is no advantage of, for example, global memory coalescing.
Thus I decided to write a dummy program to see how a block size affects kernel execution time.
As we can see, starting from the block size of 32 the execution time graph is saturated, and there are no performance improvements. Even though we are moving towards the maximum number of threads per block.
It is noticeable that the saturation point is a size of the warp. As I do not use shared memory for this particular toy kernel, there is no performance improvement with further block size growth.
Simply do not forget that your block size should be at least equal to the warp size. If for some reason you use shared memory in your kernel, maximizing the block size seems to be the best solution for using all of the advantages of fast shared memory.