How pytorch internally launches cuda kernels

Nvidia GPUs are only able to launch a limited number of threads (ex. 1024 for 1080ti) in parallel. I was wondering how pytorch adjusts grid and block size to deal with this limitation when input size is beyond the maximum parallel capability. For example, when input size is (128, 3, 270, 360) (in NCHW), even if it can fit into GPU memory, it cannot be processed in parallel altogether. I can think of splitting the large input into multiple chunks so that each chunk can be executed in parallel in GPU, and execute all of input chunks iteratively. But i believe doing this optimally maximizing GPU utilization is not that simple. How is pytorch dealing with this issue? It would be very much appreciated if any of you guys can point out source codes that i can refer to.

Hi,

You can simply have your kernel work on more than one datapoint in a for-loop.
Like this one for example.

But I am not a specialist in that domain and there are a lot of subtle things to know to be able to do that most efficiently. I would advise you to check a cuda class online to learn how to design such algorithms efficiently.

From what I’ve seen, in pytorch “elementwise” kernels just use hardcoded 512 threads per block (and data is processed like a flat 1d view). And kernels with special dimensions (“vector kernels”) are mostly dispatched to libraries like cudnn, cublas, so these select block sizes somehow…

Perhaps I’ve seen 512 somewhere else, and it is actually 256 threads/block. Basically, it is hardcoded here: https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Loops.cuh#L6

And example of “vector kernel” block size selection logic (not pretty): https://github.com/pytorch/pytorch/blob/master/aten/src/ATen/native/cuda/Reduce.cuh

1 Like

Thank you very much. It is extremely helpful.