Hi, I am testing FP16 on FPN_RPN , my device is V100, cuda9, cudnn7.1.2
I am curious when I set
torch.backends.cudnn.enabled = False
, the network speed is significantly faster than set
torch.backends.cudnn.enabled = True,
Then I statistic the time of forward and backward and I found the main difference is about gradient backward, so I use pytorch profiler to figure out the detail about gradient backward, here is the screenshot of time distribution comparison:
The main difference is on convolutionbackward, when I set cudnn.enabled, it use cudnnconvolutionbackward which is far slower than thnnconvolutionbackward.
I wonder why cudnnconvolutionbackward is so slow on V100 For FP16 compared with thnnconvolutionbackward.
I am sorry I do not have test script now because the original code is really a large project.
I test the FPN-RPN network for detection task. The input size is not fixed, if I set torch.backends.cudnn.benchmark = True, the speed is even slower for it will seek algorithms every iteration which introduce extra overhead.
When I test FP32 bits, it is normal, cudnn.enabled will make the program faster; while if the input is FP16, cudnn.enabled will make it slower.