OK I’m rather confused here… I’m running pytorch on 2 different machines that are very similar in configuration, and both contain 1080ti cards. I’m running the exact same code on both, training a standard resnet34 model. Neither are remotely CPU bound, both have loads of free memory, and both are accessing fast SSDs. Running the nbody sample on each shows very similar GPU benchmark results.
Yet on one of them all my training is running 250% slower than the other one! I’m at a loss for how to begin to debug this. Could it be a CUDNN issue? If so, how should I go about checking that? What other issues could cause it?
I have the current conda version of pytorch installed, and all my conda and pip modules are up to date. I’m using the latest anaconda py3.6. Nvidia driver version is 381.22 .