I have a retinanet model I am trying to train on a custom dataset which consists of around 2500 images in total. The problem is that it takes approximately 30 minutes per epoch(resnet 101 as a backbone). Is it supposed to be this slow?
I have PyTorch 1.7, Cuda 11.1 installed on the system with cuDNN8.0.5.
System specs - i7 9750 with RTX 2070. All hardware benchmarks run fine. Is the slow training due to cuda and cudnn mismatch and will it go away if I build from source?
I don’t know what you mean by “CUDA and cudnn mismatch”, but you could try to use e.g. the CUDA10.2 binaries and see, if you would get a similar performance.
Also note, that your training might suffer other bottlenecks such as the data loading, which should be visible by a low GPU utilization in
Hi, thanks for replying.
by mismatch I meant pytorch binaries are shipped with 11.0 whereas I have 11.1, 1.7 uses cuDNN version 8.0.3 whereas I have 8.0.5. I was wondering maybe that’s the reason for the slowdown.
During training, nvidia-smi shows GPU usage as 90%
Your local CUDA and cudnn versions won’t be used if you install the binaries, so you might indeed want to build from source and see if the performance improves.
Oh okay, will try that out