Cudnn.benchmark slowing execution down

I just played around a bit with a large architecture and found that my network trains faster when I turn off cudnn.benchmark. The network has a fixed input size. I wanted to see if it is because of my architecture and tested speed with some of the provided torchvision models. For all of them, the same holds. cudnn.benchmark slows testing and training down.

Code to reproduce:

import torch
import torchvision.models as models
import timeit


torch.backends.cudnn.benchmark = False

model = models.densenet121(pretrained=True)

in_tensor = torch.empty([16, 3, 224, 224]).cuda()

def bench():
    _ = model(in_tensor)

print(timeit.timeit("bench()", setup="from __main__ import bench", number=100))

If I run it with cudnn.benchmark = True, I measure 4.1 seconds, and with cudnn.benchmark = False, the program finishes after 3.2 seconds. That’s quite a difference. When training, the difference is even bigger.

I am running pytorch installed from conda. pytorch version is 0.4.1 with cuda 9.0 and cudnn 7.1.2 on Ubuntu 18.04. The GPU is a GTX 1080 Ti.

Am I doing something wrong? If not, then why is cudnn.benchmark enabled by default?


The first iteration should be excluded from timing, because when cudnn.benchmark is enabled, in the first iteration pytorch will try all available algorithm, and first iteration will be very slow. Also, for proper timing you should be calling cuda.device.synchronize() after you loop.


I’m seeing this behavior as well (for all epochs) but only randomly about 20-30% of the runs. When cudnn.benchmark=True, each epoch runs about 8X slower for the bad run compared to good run. Both runs have identical code and hardware but get different random seed. For the bad run GPU is driven to 100% while for the good run it stays around 80%. This is happening for 1080Ti and I’m testing if this also happens on other SKUs. My guess is that for whatever reason cudnn decides to try all algos at each epoch and drives GPU red hot. After disabling cudnn.benchmark, we don’t see this issue.

cudnn.benchmark will try different algorithms for each input shape.
If your model expects a lot of varying input shapes, you might want to disable benchmark mode.

1 Like