Training too slow on pytorch 1.7

a_d · December 3, 2020, 5:55am

Hello all,
I have a retinanet model I am trying to train on a custom dataset which consists of around 2500 images in total. The problem is that it takes approximately 30 minutes per epoch(resnet 101 as a backbone). Is it supposed to be this slow?
I have PyTorch 1.7, Cuda 11.1 installed on the system with cuDNN8.0.5.
System specs - i7 9750 with RTX 2070. All hardware benchmarks run fine. Is the slow training due to cuda and cudnn mismatch and will it go away if I build from source?

ptrblck · December 3, 2020, 8:51am

I don’t know what you mean by “CUDA and cudnn mismatch”, but you could try to use e.g. the CUDA10.2 binaries and see, if you would get a similar performance.
Also note, that your training might suffer other bottlenecks such as the data loading, which should be visible by a low GPU utilization in nvidia-smi.

a_d · December 3, 2020, 8:55am

Hi, thanks for replying.
by mismatch I meant pytorch binaries are shipped with 11.0 whereas I have 11.1, 1.7 uses cuDNN version 8.0.3 whereas I have 8.0.5. I was wondering maybe that’s the reason for the slowdown.
During training, nvidia-smi shows GPU usage as 90%

ptrblck · December 3, 2020, 8:55am

Your local CUDA and cudnn versions won’t be used if you install the binaries, so you might indeed want to build from source and see if the performance improves.

a_d · December 3, 2020, 8:57am

Oh okay, will try that out

Dicko · June 28, 2023, 10:01pm

hello there, I am having some trouble training a RetinaNet model with a ResNet-101 backbone. I have everything set up correctly I think, i’ve tried a few learning rates but the model does not seem to be learning. I wondfered if you had any issues like this and wondered if you managed to resolve it ?