Writing CUDA Extension, where to start?

I am trying to implement some operations in Torch, using GPU, however they are way slower than what they suppose to be, and I am assuming that is due to overhead of having it all in python.

Basically, I am working with DSP, and basically the only transform available is the FFT. Actually, I have to use other types of transforms, such as DCT, however I sort of implement them either via a linear layer or via FFT, see torch_dct.

I am considering implementing them as a CUDA kernel, not sure how, yet.

I started taking a look on Writing a Mixed C Cuda Extension, which I have a hard time to understand. I do understand some of it, but I also wonder where I could find more documentation on what function calls I could make.

Then I read CUDA C/C++ Basics. Maybe the next document to digest would be CUDA C++ PROGRAMMING GUIDE

I am not really sure if I am following the right path, so I am asking for some directions.

Basically, you can call any maths functions in cuda. FP32/INT32 is reasonably fast, fp64/int64 (for indexing) is significantly slower

I would recommend looking at CUDA programming patterns for typical ops, e.g.

  • pointwise operations (the easiest),
  • reductions (e.g. the combination of warp shuffles and shared memory),

The other part to know is how global / local memory and caching works.
Also look at code, PyTorch has implementations for many ops, so does torchvision for the detection ops.

There are also some resources looking at kernels for PyTorch, e.g. I discuss the a reduction pattern on my blog. This isn’t terribly efficient for matrix multiplication-like patterns, but it’s faster than CPU. :slight_smile:

Best regards

Thomas

1 Like

Thank you, @tom

I am wondering with I should try to implement these transforms in CUDA, or just have a C++/Cuda implementation of my application, having the transforms implemented based on the FFT.

  • If I want to compare these transforms with the FFT, it will be challenging to come up with the a CUDA code as good as the one for cuFFT.
  • If I do the transforms based on cuFFT, I would have a faster development, but the comparison would be kind of pointless, since it is based on cuFFT.

I guess I will start with the C++ code in ATen lib.

You could compare your custom CUDA extension with the CPU version, which would not use cuFFT and should thus be a proper reference. :wink:

I don’t think that would be a fair comparison.

I’m not thinking about the performance, but functionality.
If you are certain that your code works perfectly fine, you won’t need to compare it to the CPU outputs of course. :wink: