nn.Module.to super slow on RTX 3060

I’m running a machine learning model, using pre-trained weights for inference. I’ve been running it on my laptop with a Geforce 940 MX, so I was excited to try it out on my new desktop with an RTX 3060. To my surprise, it takes ~30 minutes to load the model. After that, it runs quickly (though the results are bad, but I’m crossing my fingers that solving this issue will also solve that). I am not familiar with the model architecture, since I didn’t build it, but I finally tracked down the issue to a line “netG.to(device)”, where netG inherits from nn.Module.

Strangely, this only happens on my desktop, and not on my laptop. I went so far as to move everything over to an external drive, including the conda virtual environment so I could make sure I was running the exact same thing on the same input using the same setup, and it still is fast with good results on the laptop, but slow with poor results on the desktop.

What could be causing this issue?

This smells as if the CUDA JIT compiler kicks in.
Which PyTorch binaries are you using?
If you are not using a CUDA 11.x runtime, please update PyTorch to the latest version with CUDA 11.6+.

Thanks for the quick reply ptrblck,

I’m using pytorch 1.8.0 py3.6_cuda10.2_cudnn7_0, with cudatoolkit 10.2.89 h74a9793_1. I haven’t yet been able to get the model to run with the latest pytorch/cuda, since somewhere in the chain of necessary dependency updates, something makes it break.

Any idea why it would run fine on my windows 11 940MX laptop, but not my windows 10 RTX 3060 desktop? Like I said, I’m using the exact same environment on an external disk, so that seems to rule out everything but OS and hardware/drivers. I’m leaning toward the latter, but don’t know much about drivers. Would something about the RTX 3060 cause the CUDA JIT to run when it doesn’t for the 940MX?

Your RTX 3060 is an Ampere generation GPU with compute capability 8.6 and thus needs CUDA 11.x.
The older 940MX should be a Maxwell generation and thus compute capability 5.0 which works with older CUDA runtimes (10.2 in this case).

This explains the slow startup time as CUDA is trying to JIT compile the framework.
Uninstall this old PyTorch version and update to the latest one with a CUDA 11 runtime.