I have installed pytorch on a Tesla V100. The problem is that it is way too slow; after 16 hours, not even one epoch of imagenet was completed! I believe this should be much faster in normal circumstances.
I have cuda9 installed and the driver version is NVIDIA-SMI 384.98, which I think it is good enough. I have also the most NCLL2 and the most recent version of cuDNN.
What else should I check to make sure the V100 is configured properly? And how to fix the problem?
I am using a wide resNet. But this is a general problem: if I run resnet18 on cifar100 on a Titan Xp, each epoch takes about 10 minutes. If I run the exact same code on a V100, it takes an hour, i.e. it goes 6x times slower
I guess there must be something wrong with my settings, but I can’t figure out what. In the meanwhile I have installed cuda 9.1 and driver version 387.26, but the problem persists.
Hi-It could be the case that your dataloader is taking more time than model forward or backward
Check the time to load images properly. Use time.per_counter() instead of time.time() and torch.cuda.synchronize() before you do the time reading for eg.
Hi, have you solved the problem? I also encountered the same problem. My code runs quickly on a TITAN XP. After I copied to DGX-1 V100, it is very slow (about 3 times slower than TITAN XP). By the way, I have set torch.backends.cudnn.benchmark = True.
Hi @antspy, I encounter the same problem as yours. In my case, I found loss.backward() is 5 times slower in V100 than in TITAN XP. Did you solve the problem. Could you provide any suggestions? Thank you.