FP16 CudnnConvolutionBackward is slower than ThnnConv2DBackward

Hi, I am testing FP16 on FPN_RPN ,
my device is V100, cuda9, cudnn7.1.2

I am curious when I set

torch.backends.cudnn.enabled = False

, the network speed is significantly faster than set

torch.backends.cudnn.enabled = True,

Then I statistic the time of forward and backward and I found the main difference is about gradient backward, so I use pytorch profiler to figure out the detail about gradient backward, here is the screenshot of time distribution comparison:

The main difference is on convolutionbackward, when I set cudnn.enabled, it use cudnnconvolutionbackward which is far slower than thnnconvolutionbackward.

I wonder why cudnnconvolutionbackward is so slow on V100 For FP16 compared with thnnconvolutionbackward.

Thanks!:grinning:

Did you tried setting torch.backends.cudnn.benchmark = True ?
Maybe the default algorithm is not so good.

Also do you have a small script with random input data that would reproduce this please?

I am sorry I do not have test script now because the original code is really a large project.

I test the FPN-RPN network for detection task. The input size is not fixed, if I set torch.backends.cudnn.benchmark = True, the speed is even slower for it will seek algorithms every iteration which introduce extra overhead.

When I test FP32 bits, it is normal, cudnn.enabled will make the program faster; while if the input is FP16, cudnn.enabled will make it slower.

Ok,
I’m sure @ngimel is going to be interested about these observations !