High GPU temperature issue with Pytorch

Hello,

I have Pytorch code written for semantic segmentation, technically code is working fine without any errors. But I am facing High GPU temperature issue. Same code is written in Tensorflow, My GPU only reach 65 °C temperature, But in Pytorch case my GPU reach 85 °C temperature. Here is all details of my GPU and training configurations:

Hardware:

GPU: NVIDIA GeForce RTX 2060 Super - 8GB

RAM: 16 GB

Software:

Pytorch v1.6

CUDA 10.2

Training Setting:

Img_size = (256, 256)

batch_size = 4

num_epochs = 100

lr = 1e-4

num_workers=2

1 Like

Hello,

The temperature of GPU is mainly related to GPU-Util. Can you check if both codes give you the same GPU-util and same memory usage on nvidia-smi ?

In case you see a difference in GPU-util, it maybe related to how you read and batch over the dataset, so you can first check the amount of time needed to read it batch in your tf code and torch code.

Please check following screenshot details, and let me know why PyTorch is creating high temperature on GPU and what is solution?

PyTorch nvidia-smi screenshot:

Tensorflow nvidia-smi screenshot:

Hello,

It seems that torch is taking use of all the GPU and consuming less memory, which is good I guess. I think your code in tf is not well optimized, the main reason behind this is that you waste time transferring data from RAM to GPU in your tf implementation.

The heat level you have is not a problem I guess, normally you want to use fully your GPU so that you can ran faster. So the fact that GPU-UTIL is 99% is a good sign in fact.

Thanks for reply me back!

Please check, In GPU software, here, i set target temperature to 70 °C, now GPU temperature is limited under 70 °C. But power % is also reduced 100% to 79%. These settings are good for me? However, in factory settings, target temperature was 83 °C and power was 100%.

As @omarfoq explained, you usually want to fully utilize the GPU to get the best performance.
If your system hits a thermal issue, you might either want to fix it (better cooling) or artificially slow down your code by e.g. limiting the power usage (which would also slow down your code and is not the optimal solution).

Thanks for reply me back.

Currently, i have following configuration:

Train Images: 1800 - Valid: 450
batch_size = 12
num_workers = 0

And my average GPU temperature remain 80 °C, it is normal? or here, i can also try something else?

Hello @ahmediqbal, I think that 80 °C is not a big problem I guess. Since the RTX 2060 Super has a maximum thermal temperature of 89 Celcius. Having 80 °C won’t harm the card and won’t harm the performance.

@omarfoq @ptrblck i fixed GPU high temperature issue by increasing Fan speed Auto to Manual 75%. So, it cool down GPU, and my GPU temperature decrease 80 °C to 60 °C.
One more thing, what is role of num_workers ? and what is formula to select required num_workers?

Hi,

Ah that’s nice, I wasn’t aware that you can manually choose fan speed. For number of workers, if you are talking the number of workers used when creating dataloader, it just refers to how many subprocesses to use for data loading. (See here) There is no formula to chose it to my knowledge. Usually the best choice is related to your cpu (how many cores and how many thread by core). A typical choice is 8 because many personal computers have 8 CPU cores and one worker thread per core. But if you have different CPU architecture, other values may be better. Note also that number of workers is not limited, you can choose any value you want, in that case the subprocesses well be waiting in a Pool.