Slow pytorch on V100

I have installed pytorch on a Tesla V100. The problem is that it is way too slow; after 16 hours, not even one epoch of imagenet was completed! I believe this should be much faster in normal circumstances.

I have cuda9 installed and the driver version is NVIDIA-SMI 384.98, which I think it is good enough. I have also the most NCLL2 and the most recent version of cuDNN.

What else should I check to make sure the V100 is configured properly? And how to fix the problem?

Thank you!

Did you make sure that the GPU is being utilized?
Check nvidia-smi to see the usage

$ nvidia-smi

Yes, it was being used. I used batch size of 64, occupying about 14 Gb

Does someone have any idea?

Which network are you using?


I am using a wide resNet. But this is a general problem: if I run resnet18 on cifar100 on a Titan Xp, each epoch takes about 10 minutes. If I run the exact same code on a V100, it takes an hour, i.e. it goes 6x times slower

I guess there must be something wrong with my settings, but I can’t figure out what. In the meanwhile I have installed cuda 9.1 and driver version 387.26, but the problem persists.

What PyTorch version are you using?


I am using pytorch 0.3. At first I built it from scratch, then removed that and installed through conda. In both cases I had similar results.

Hi-It could be the case that your dataloader is taking more time than model forward or backward

Check the time to load images properly. Use time.per_counter() instead of time.time() and torch.cuda.synchronize() before you do the time reading for eg.

t0 = time.perf_counter()
ouput = model(input)
loss = criterion(ouput, target)
print('Time taken',time.perf_counter()-t0)

It is weird indeed… I can’t think of a good explanation for this. Does same thing happen to your other GPU applications?

Are you using torch.backend.cudnn.benchmark = True?

Yes - that improved things somewhat (2x speedup), but still not working as fast as I expected.

Hi,I also encountered the same problems , Have you already solved it ?

Hi, have you solved the problem? I also encountered the same problem. My code runs quickly on a TITAN XP. After I copied to DGX-1 V100, it is very slow (about 3 times slower than TITAN XP). By the way, I have set torch.backends.cudnn.benchmark = True.

Hi @antspy, I encounter the same problem as yours. In my case, I found loss.backward() is 5 times slower in V100 than in TITAN XP. Did you solve the problem. Could you provide any suggestions? Thank you.

My environment settings are:

  1. anaconda;
  2. python 3.6.6
  3. pytorch 1.0
  4. cuda 10.0
  5. cudnn 7.4.2