Very slow training on RTX 2070 8GB RAM

I am experiencing very slow training on RTX 2070 on Windows 10 and the GPU is hardly used, while the same exact code is super fast on GTX 1080Ti. After some research I concluded that training must only be running on CUDA cores and not on Tensor cores. I tried using .half() for both model and for inputs, but the training is still just as slow. Even after I switched to using apex (I used the conda distribution from here: https://anaconda.org/conda-forge/nvidia-apex) the training is still slow.

I am kind of out of options now.
Is there an example script using .half() or apex that I can use to check whether GPU can be fully used?
Is there a way to monitor whether Tensor cores are being used on Windows?

Any help is appreciated!

You could use pyprof to check for TensorCore usage.

Are both GPUs in the same machine? If not, is the setup of the machines similar?

Thanks! I will take a look at pyprof. The setup on GTX is very different, it is a Linux cluster with 4 GPUs in a node. But performance on my local RTX is x3,500 slower than on the cluster.

On RTX I can see that GPU RAM is being used and CUDA cores are heavily used, but the overall GPU usage is around 7%.

I tried to experiment with reducing batch size and number of workers, but it does not seem to make a difference.

Could you try to profile the data loading pipeline, e.g. using this code, or alternatively remove the data loading and feed a random input to your model?
This would give you some information, if the data loading is the bottleneck, which might be the case given the low utilization of the device.

I followed your suggestion and fed random inputs to the model, but no luck. Just to make sure if I did this correctly: I removed the data loader altogether and just loaded a batch of random tensors, but this also did not increase my GPU usage. I still need to look at pyprof.

I noticed that changing the number workers from 0 to 16 does not seem to significantly change my CPU usage (I have 8 cores). That seems strange.

For the imagenet example it seems that I need to download the data, is that correct? I am waiting for approval. Do you by any chance have another example to check both mixed precision and data loading bottlenecks? Any ideas are appreciated! Thanks

I would recommend to narrow down the issue of a low utilization first.
If you’ve removed the data loading and use a random CUDATensor, the utilization should be higher.
At least if your code is not bottlenecked by anything else, so try to remove all unnecessary code and just keep the training (forward, backward, and optimizer).

Thanks a lot for this tip, I will try that.