How to have FFT using multiple GPU streams?

Comparing the FFT execution time with the one of Conv2d, I noticed that FFT operation is way slower than expected.

Doing some investigation, I noticed the Conv2d uses the FFT method, and somehow is able to launch several stream.
While when I called nn.funcitonal.fft.fft2(inputs) myself, I was only able to launch one stream.

Is there any way to improve the FFT calls? I guess it is safe to say it should be possible. cudnn_convolution is somehow doing it.

How does the number of stream affects speed?

This is the profiling of nn.functional.conv2d

This is the profiling of nn.functional.fft.fft2

You can use torch.cuda.Stream to create custom streams. Check the docs carefully (as well as the streams part here) as I would consider it an advanced mechanism since you can easily create race conditions in your code.

Thank you so much.

How does the number of stream affects speed? I am also trying to understand which streams have the potential to benefit from stream.

Would the following mindset be accurate?

A grid of blocks is dispatched for a task A on the Main stream. For certain sizes of grids and blocks it could happen that towards the end of the computation some SM become available, while others are still busy. For those scenarios, if multiple if a second stream (S) is used to process a second task (B), at that point, those idle SM would be able to be allocated to start processing the operations for B, given they would not depend on the result of A.

The mindset is generally right and yes, enough resources must be available to execute concurrently.
This presentation as well as this one could give you a good idea.

1 Like

Thank you so much for all feedbacks and great resources.