A computer with 4 GPUs has different error

yelanyelan · November 3, 2021, 2:34pm

My computer has 4 GPUs, they are all NVIDIA GeForce 2080TI, my environment is:
NVIDIA-SMI 470.42.01 Driver Version: 470.42.01 CUDA Version: 11.4.20210623 cuDNN Version: 8.2.4.15-1+cuda11.4 pytorch Version: 1.7.0
The same code runs on the 1st and 3rd GPU and everything works fine with pytorch, but there are different errors on GPU 0 and 2.

When using CUDA_VISIBLE_DEVICES=0：I usually connect the screen to GPU 0 for display, and when it is running, I always closed the visual interface. It reported when running:

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [43,0,0], thread: [32,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [43,0,0], thread: [33,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [43,0,0], thread: [34,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:115: operator(): block: [43,0,0], thread: [35,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
…
Traceback (most recent call last):
…
File “/home/Project/test.py”, line 47, in get_rate_eachc
_, inverse, counts = torch.unique(CH_area[i], return_inverse=True, return_counts=True)
File “/home/wenqu/workspace/pytorch/lib/python3.6/site-packages/torch/_jit_internal.py”, line 265, in fn
return if_true(*args, **kwargs)
File “/home/wenqu/workspace/pytorch/lib/python3.6/site-packages/torch/_jit_internal.py”, line 265, in fn
return if_true(*args, **kwargs)
File “/home/wenqu/workspace/pytorch/lib/python3.6/site-packages/torch/functional.py”, line 682, in _unique_impl
return_counts=return_counts,
RuntimeError: transform: failed to synchronize: cudaErrorAssert: device-side assert triggered

When using CUDA_VISIBLE_DEVICES=0：

Traceback (most recent call last):
…
File “/home/workspace/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/Project/test2.py”, line 44, in forward
theta = self.tanh(self.conv0(x))
File “/home/workspace/pytorch/lib/python3.6/site-packages/torch/nn/modules/module.py”, line 727, in _call_impl
result = self.forward(*input, **kwargs)
File “/home/workspace/pytorch/lib/python3.6/site-packages/torch/nn/modules/conv.py”, line 423, in forward
return self._conv_forward(input, self.weight)
File “/home/workspace/pytorch/lib/python3.6/site-packages/torch/nn/modules/conv.py”, line 420, in _conv_forward
self.padding, self.dilation, self.groups)
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([4, 3, 224, 224], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(3, 128, kernel_size=[3, 3], padding=[1, 1], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
data_type = CUDNN_DATA_FLOAT
padding = [1, 1, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x499d450
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 4, 3, 224, 224,
strideA = 150528, 50176, 224, 1,
output: TensorDescriptor 0x77c4b480
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 4, 128, 224, 224,
strideA = 6422528, 50176, 224, 1,
weight: FilterDescriptor 0x7c448fe0
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 128, 3, 3, 3,
Pointer addresses:
input: 0x7fc1ba000000
output: 0x7fc1b2000000
weight: 0x7fc2211b9000

When using CUDA_VISIBLE_DEVICES=0,1,2： Reported the same error as GPU 0.
These errors do not occur on 1,3 GPU. So I cant determine where the problem is, whether my code is wrong or the GPUs have some problems. Does anyone have the same problem？
Thx u for any help.

ptrblck · November 3, 2021, 7:28pm

I’m not sure which error is raised on which device, as you’ve mentioned device 0 in both error messages.
Based on the first error, your are running into an indexing issue. Rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.pt args and check which operation fails. Once you’ve isolated it, check the indices and make sure they are valid.

After this is fixed, rerun the code on the other devices and check, if the cuDNN error is still raised.
CUDNN_STATUS_INTERNAL_ERROR is also raised, if cuDNN is running into a sticky error caused by a previously failed operation. Since you were seeing the invalid indexing assert, cuDNN might just reraise it.

yelanyelan · November 5, 2021, 7:36am

Anyway, thank you very much for your answer. Now my situation is a bit complicated.
On device 0,first, I was trying to fix the indexing issue with CUDA_LAUNCH_BLOCKING=1. I found it might come from torch.gather, so I tried to cancel the use of it. It worked and device 0 can training with ‘CUDA_LAUNCH_BLOCKING=1’. But after training for a while, device 0 disappeared from nvidia-smi. I used sudo nvidia-persistenced --persistence-mode to fix this issue.

By the way, I have a custom loss, maybe its gradient will be relatively large，we temporarily call it 'A loss‘’. Continue training, I found that device 0 has the following situations：

It can training, and anything a simple network: for example, a network with only one convolutional layer and using mseloss. (Whether using MSE loss or Aloss, whether using CUDA_LAUNCH_BLOCKING=1 or not)
When using a Complex network(with resnet34 backbone and jumper connection, decoding network):

It will report cuDNN error: CUDNN_STATUS_INTERNAL_ERROR when running without using CUDA_LAUNCH_BLOCKING=1.
When running with CUDA_LAUNCH_BLOCKING=1, it can be trained not to ask any questions. But its loss is easy to become NAN when using Aloss.

I make a few guesses：

On device 0, it seems to be more sensitive to large values than others. So it is more likely to cause the value to overflow and cause loss to become NAN.
This problem may not come from pytorch but the GPU itself. Because my code does not have any problems mentioned in the above device 0 on device 1, 3.
But I cannot understand CUDA_LAUNCH_BLOCKING=1 cannot solve the problem of cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

As for device 2, when I try to use A to solve it, it will arise Unable to find a valid cuDNN algorithm to run convolution. I carefully referred to the questions you answered:
Unable to find a valid cuDNN algorithm to run convolution
And it doesnt worked.

In the end, I found that the above two devices 0 and 2 that have problems are both when trying to train:
Their Fan and Pwr Usage will become ERR.
QQ截图20211105153021

I am very grateful to your help. If you have any relevant information please help me.

ptrblck · November 5, 2021, 8:02am

CUDA_LAUNCH_BLOCKING=1 will synchronize each call and thus slow down your code.
Based on your description and the output of nvidia-smi your devices might indeed be overheating.
Could you open the case and put a large fan in front of the machine while it’s running?
If that’s possible, check nvidia-smi -l and observe if the temperature would still be rising to critical levels.

yelanyelan · November 5, 2021, 9:21am

I think you are right, so I just went to the computer room to disassemble my computer. I tried to clean and reapply the silicone grease for my two problematic devices.
Then I found that after removing the two GPUs that were out of order, the two GPUs that I used to work normally also had problems.

Device 0（Before it was device 1）raise RuntimeError: CUDA error: unspecified launch failure
Device 1（Before it was device 3）raise cuDNN error: CUDNN_STATUS_INTERNAL_ERROR

They seem to inherit the original bug.

This device should be the original device 3. I make sure that I just started the computer, and their temperature is not high. I am a little confused now.

ptrblck · November 5, 2021, 9:37am

OK, that’s indeed weird and the “unspecified lunch failure” often points to an issue in the lower stack.
I assume all devices are 2080Tis? Could you update PyTorch to the latest version and check, if it would still be failing? I guess you’ve built PyTorch from source?
Would it also be possible to post a minimal code snippet to reproduce the issue (the gather operation might be sufficient)? If so, I would try to reproduce your setup and see if I could also hit the issue.

yelanyelan · November 5, 2021, 9:59am

All devices are 2080ti.
I would try to update or built PyTorch. If nothing works, I should try to post my error code.

ptrblck · November 5, 2021, 7:05pm

Thanks! Also, could you check dmesg and search for Xids?
dmesg -T | grep -i xid would show if errors are detected. If so, could you post the errors here, too?