Why is pytorch's GPU utilization so low in production ( NOT training )?

Transferring data to the CPU e.g. to print these tensors will add synchronization to your code, which might slow down the overall epoch duration.
Often (especially for a complete training of the model) this is negligible.

1 Like

Hi,

I had to set the batch_size of the data loader in your code to 64, else I was not able to hit over 57% on RTX 3080.

I was testing a different on my 3060TI and 3080, surprisingly the 3060TI is performing quite fast than 3080. Usually the usage is capped at 5-20% on my 3080.

My cudnn backend is 8005 and cuda is 11.1
Can you let me know what could be the issue?

The result is consistent with the multiple runs, batchsize being 64 seems to do the job.

awaiting your response.

A low GPU utilization could be caused by different bottlenecks in your code, such as e.g. data loading. A proper way to isolate it would be to profile the code or remove the data loading and just profile the actual GPU workload.
Also, you could try out the nightly binaries, which ship with cuDNN8.2.2 and could yield another speedup (it won’t solve the low util. in case the bottleneck is indeed in another part of the code).

1 Like