Pytorch runs 250% slower on same GPU in different machine

jphoward · July 13, 2017, 6:06am

OK I’m rather confused here… I’m running pytorch on 2 different machines that are very similar in configuration, and both contain 1080ti cards. I’m running the exact same code on both, training a standard resnet34 model. Neither are remotely CPU bound, both have loads of free memory, and both are accessing fast SSDs. Running the nbody sample on each shows very similar GPU benchmark results.

Yet on one of them all my training is running 250% slower than the other one! I’m at a loss for how to begin to debug this. Could it be a CUDNN issue? If so, how should I go about checking that? What other issues could cause it?

I have the current conda version of pytorch installed, and all my conda and pip modules are up to date. I’m using the latest anaconda py3.6. Nvidia driver version is 381.22 .

hughperkins · July 13, 2017, 7:50am

I’ve seen something similar, and am curious what reasons come up. I put it down as a difference in cpus, but it was for an rnn. for a convnet, thatd be less convincing…

asa008 · April 23, 2019, 8:09am

the same GPU, same resnet50, however libtorch is 3 times slower than caffe when loading model and predicting, why?