Hi, I am testing FP16 on FPN_RPN ,
my device is V100, cuda9, cudnn7.1.2
I am curious when I set
torch.backends.cudnn.enabled = False
, the network speed is significantly faster than set
torch.backends.cudnn.enabled = True,
Then I statistic the time of forward and backward and I found the main difference is about gradient backward, so I use pytorch profiler to figure out the detail about gradient backward, here is the screenshot of time distribution comparison:
The main difference is on convolutionbackward, when I set cudnn.enabled, it use cudnnconvolutionbackward which is far slower than thnnconvolutionbackward.
I wonder why cudnnconvolutionbackward is so slow on V100 For FP16 compared with thnnconvolutionbackward.