A computer with 4 GPUs has different error

yelanyelan · November 5, 2021, 7:36am

Anyway, thank you very much for your answer. Now my situation is a bit complicated.
On device 0,first, I was trying to fix the indexing issue with CUDA_LAUNCH_BLOCKING=1. I found it might come from torch.gather, so I tried to cancel the use of it. It worked and device 0 can training with ‘CUDA_LAUNCH_BLOCKING=1’. But after training for a while, device 0 disappeared from nvidia-smi. I used sudo nvidia-persistenced --persistence-mode to fix this issue.

By the way, I have a custom loss, maybe its gradient will be relatively large，we temporarily call it 'A loss‘’. Continue training, I found that device 0 has the following situations：

It can training, and anything a simple network: for example, a network with only one convolutional layer and using mseloss. (Whether using MSE loss or Aloss, whether using CUDA_LAUNCH_BLOCKING=1 or not)
When using a Complex network(with resnet34 backbone and jumper connection, decoding network):

It will report cuDNN error: CUDNN_STATUS_INTERNAL_ERROR when running without using CUDA_LAUNCH_BLOCKING=1.
When running with CUDA_LAUNCH_BLOCKING=1, it can be trained not to ask any questions. But its loss is easy to become NAN when using Aloss.

I make a few guesses：

On device 0, it seems to be more sensitive to large values than others. So it is more likely to cause the value to overflow and cause loss to become NAN.
This problem may not come from pytorch but the GPU itself. Because my code does not have any problems mentioned in the above device 0 on device 1, 3.
But I cannot understand CUDA_LAUNCH_BLOCKING=1 cannot solve the problem of cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

As for device 2, when I try to use A to solve it, it will arise Unable to find a valid cuDNN algorithm to run convolution. I carefully referred to the questions you answered:
Unable to find a valid cuDNN algorithm to run convolution
And it doesnt worked.

In the end, I found that the above two devices 0 and 2 that have problems are both when trying to train:
Their Fan and Pwr Usage will become ERR.
QQ截图20211105153021

I am very grateful to your help. If you have any relevant information please help me.