Rtx 3060 ti is slow in pytorch

I am running the same code on my personal PC with an Rtx 3060 ti and on computing clusters with an old Tesla P100 12gb. On both systems, I installed the latest version of PyTorch and Cuda toolkit with conda.

Could you help me why training on my PC is significantly slower, although my GPU should be stronger.


The strength of the GPU is not the only factor, bus size may also play a role. In order to verify can you time the time needed to do some simple operation, without taking into account data transfer

Both the GPU’s are different in architectures. However, RTX 30 series are latest and performs better than Tesla in terms of TFLOPS. But, i would suggest you to Change the variables in dataloader namely pin_memory = True, num_workers = (vary according to your system specifications). For instance if you have 20 threads processor do not use all the processors to load your data to GPU. Try to find that sweet spot of num_workers according to your system specifications.

If you are installing the conda binaries or pip wheels, you might be facing this issue.
A workaround would be to build PyTorch from source with the latest cudnn version.