Slow pytorch on V100

antspy · December 11, 2017, 10:17am

I have installed pytorch on a Tesla V100. The problem is that it is way too slow; after 16 hours, not even one epoch of imagenet was completed! I believe this should be much faster in normal circumstances.

I have cuda9 installed and the driver version is NVIDIA-SMI 384.98, which I think it is good enough. I have also the most NCLL2 and the most recent version of cuDNN.

What else should I check to make sure the V100 is configured properly? And how to fix the problem?

Thank you!

derEitel · December 11, 2017, 11:02am

Did you make sure that the GPU is being utilized?
Check nvidia-smi to see the usage

$ nvidia-smi

antspy · December 11, 2017, 11:13am

Yes, it was being used. I used batch size of 64, occupying about 14 Gb

antspy · December 20, 2017, 11:20am

Does someone have any idea?

munkiti · December 20, 2017, 11:25am

Which network are you using?

antspy · December 20, 2017, 9:30pm

Hi,

I am using a wide resNet. But this is a general problem: if I run resnet18 on cifar100 on a Titan Xp, each epoch takes about 10 minutes. If I run the exact same code on a V100, it takes an hour, i.e. it goes 6x times slower

I guess there must be something wrong with my settings, but I can’t figure out what. In the meanwhile I have installed cuda 9.1 and driver version 387.26, but the problem persists.

SimonW · December 20, 2017, 9:45pm

What PyTorch version are you using?

antspy · December 20, 2017, 10:04pm

Hi,

I am using pytorch 0.3. At first I built it from scratch, then removed that and installed through conda. In both cases I had similar results.

Gurkirt · December 20, 2017, 10:12pm

Hi-It could be the case that your dataloader is taking more time than model forward or backward

Check the time to load images properly. Use time.per_counter() instead of time.time() and torch.cuda.synchronize() before you do the time reading for eg.

torch.cuda.synchronize()
t0 = time.perf_counter()
ouput = model(input)
loss = criterion(ouput, target)
loss.backward()
torch.cuda.synchronize() 
print('Time taken',time.perf_counter()-t0)

SimonW · December 24, 2017, 10:09am

It is weird indeed… I can’t think of a good explanation for this. Does same thing happen to your other GPU applications?

ngimel · December 24, 2017, 8:45pm

Are you using torch.backend.cudnn.benchmark = True?

antspy · December 27, 2017, 8:48am

Yes - that improved things somewhat (2x speedup), but still not working as fast as I expected.

blankWorld · July 25, 2018, 12:57am

Hi,I also encountered the same problems , Have you already solved it ?

ct_zhang · November 14, 2018, 7:24am

Hi, have you solved the problem? I also encountered the same problem. My code runs quickly on a TITAN XP. After I copied to DGX-1 V100, it is very slow (about 3 times slower than TITAN XP). By the way, I have set torch.backends.cudnn.benchmark = True.

Chao_Huang · February 20, 2019, 6:02am

Hi @antspy, I encounter the same problem as yours. In my case, I found loss.backward() is 5 times slower in V100 than in TITAN XP. Did you solve the problem. Could you provide any suggestions? Thank you.

My environment settings are:

anaconda;
python 3.6.6
pytorch 1.0
cuda 10.0
cudnn 7.4.2