CUDNN errors when training on HPC cluster - A100/ A40

Hi,

My script was training fine till yesterday. Now, I can run it fine on a single A100 PC. But, when I submit the batch job using LSF, I now get strange CUDNN errors:

Traceback (most recent call last):
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1871, in
train_FocusNet(del_log_dirs=args.del_log_dirs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1792, in train_FocusNet
total_loss.backward() # compute gradients
File “/usr/scratch4/samo4615/miniconda3/envs/av2/lib/python3.10/site-packages/torch/_tensor.py”, line 488, in backward
torch.autograd.backward(
File “/usr/scratch4/samo4615/miniconda3/envs/av2/lib/python3.10/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_INTERNAL_ERROR
You can try to repro this exception using the following code snippet. If that doesn’t trigger the error, please include your original repro script when reporting this issue.

import torch
torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.benchmark = True
torch.backends.cudnn.deterministic = False
torch.backends.cudnn.allow_tf32 = True
data = torch.randn([2, 162, 448, 448], dtype=torch.float, device=‘cuda’, requires_grad=True)
net = torch.nn.Conv2d(162, 2, kernel_size=[1, 1], padding=[0, 0], stride=[1, 1], dilation=[1, 1], groups=1)
net = net.cuda().float()
out = net(data)
out.backward(torch.randn_like(out))
torch.cuda.synchronize()

ConvolutionParams
memory_format = Contiguous
data_type = CUDNN_DATA_FLOAT
padding = [0, 0, 0]
stride = [1, 1, 0]
dilation = [1, 1, 0]
groups = 1
deterministic = false
allow_tf32 = true
input: TensorDescriptor 0x2b907d2f9df0
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 162, 448, 448,
strideA = 32514048, 200704, 448, 1,
output: TensorDescriptor 0x2b907d2d5100
type = CUDNN_DATA_FLOAT
nbDims = 4
dimA = 2, 2, 448, 448,
strideA = 401408, 200704, 448, 1,
weight: FilterDescriptor 0x2b8f20016340
type = CUDNN_DATA_FLOAT
tensor_format = CUDNN_TENSOR_NCHW
nbDims = 4
dimA = 2, 162, 1, 1,
Pointer addresses:
input: 0x2b90b0000000
output: 0x2b8f66360000
weight: 0x2b8dd95fc600

=================================================

Traceback (most recent call last):
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1871, in
train_FocusNet(del_log_dirs=args.del_log_dirs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1792, in train_FocusNet
total_loss.backward() # compute gradients
File “/usr/scratch4/samo4615/miniconda3/envs/av2/lib/python3.10/site-packages/torch/_tensor.py”, line 488, in backward
torch.autograd.backward(
File “/usr/scratch4/samo4615/miniconda3/envs/av2/lib/python3.10/site-packages/torch/autograd/init.py”, line 197, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Please help.

Best Regards
Sambit

Could you explain how you’ve installed PyTorch and which version?
Also, did anything in your environment change and are you preloading e.g. cuDNN from your system libs (by accident)?

I use miniconda.
Then I install pytorch using pip since the conda install method takes a very long time (I have waited for like 10 minutes and then gave up).

I am not sure what you meant by pre-loading CuDNN - I do not do anything special, just install pytorch and start training.

I have tried versions 1.12.1 with CUDA 11.3/ 11.6 and also 1.13.0 with CUDA 11.6.

Best Regards
Sambit

Could you rerun your workload via:

LD_DEBUG=files python script.py args 2>&1 | tee out01
cat out01 | grep libcudnn

and post the output here, please?

Sure - in the batch job script it should be like:

conda activate python39
export LD_DEBUG=files
python <path_to_my_trainer> 2>&1 | tee out01

Then after the error:
cat out01 | grep libcudnn

Is that correct?

Best Regards
Sambit

I would not export this environment variable, as all your outputs will be very verbose. Instead just prepend it to the python command as seen in my example.

Hi,
I solved this issue.
The problem was I accidentally deleted the line activating my conda env on the cluster.

This issue may please be closed.

Best Regards
Sambit

It’s good to hear you’ve solved the issue, but could you explain which environment was used instead and how it was built or installed?

The HPC cluster uses Centos and I use a conda environment.
My LSF script looks like

#!/bin/tcsh
conda activate av2

which python

python train/train_detector_optim.py --del_log_dirs=Y

However, I accidentally deleted the conda activate av2 line and did not notice. As a result, the correct cuDNN libraries were not found.

But hold on, I think this is not the real fix. I just ran into the same problem again even with correct conda env sourced. I will update more on this soon.

Now I get a bunch of similar error as before:

/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3946,0,0], thread: [28,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3946,0,0], thread: [29,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3946,0,0], thread: [30,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3946,0,0], thread: [31,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [80,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [81,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [82,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [83,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [84,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [85,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [86,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [87,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [88,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [89,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [90,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [91,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [92,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [93,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [94,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [2043,0,0], thread: [95,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [48,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [49,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [50,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [51,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [52,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [53,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [54,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [55,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [56,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
)��>�=�>Y-�>�}�>�5�>���>���>u�>s�>���>U�>��?���>���>���>��>M?Yٮ>���>���>��>�>��>�g�>�>�Z�>��>�2�>Q�>*n�>(��>���>t��>X��>���>�d�>h:144: operator(): block: [1359,0,0], thread: [57,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [58,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [59,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [60,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [61,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [62,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1359,0,0], thread: [63,-1,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1905, in
train_FocusNet(del_log_dirs=args.del_log_dirs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1785, in train_FocusNet
mask_hwl = get_hwl_regression_mask_tensor(train_y_hwl)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 116, in get_hwl_regression_mask_tensor
b, ch, r, c = torch.where(train_y_tensor != 0)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

now this:

Traceback (most recent call last):
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1905, in
train_FocusNet(del_log_dirs=args.del_log_dirs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1785, in train_FocusNet
mask_hwl = get_hwl_regression_mask_tensor(train_y_hwl)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 116, in get_hwl_regression_mask_tensor
b, ch, r, c = torch.where(train_y_tensor != 0)
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.

This error points to an invalid indexing operation so rerun the code with blocking launches as suggested on the error message and check which indexing operation fails by checking the min. and max. values of the indexing tensor and the shape of the tensor being indexed.

I tried that and the new error is:

Traceback (most recent call last):
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1893, in
train_FocusNet(del_log_dirs=args.del_log_dirs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/train/train_detector_optim.py”, line 1772, in train_FocusNet
y_hat, y_hat_hwl, y_hat_roty, y_hat_attn_l2, y_hat_attn_l3 = net1(train_x) # NOTE: Get predictions
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/lit_models/lit_model_focusnet.py”, line 79, in forward
enc_feats = self.encoder(x)
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/lit_models/lit_encoder_decoder.py”, line 42, in forward
x1_high, x1_low = self.db1(x)
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/Documents/codeworks/radarworks/argoverse_2/root/lit_models/lit_blocks.py”, line 102, in forward
x_copy = self.c_skip(x_copy) # get it ready for adding
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 1190, in _call_impl
return forward_call(*input, **kwargs)
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/conv.py”, line 463, in forward
return self._conv_forward(input, self.weight, self.bias)
File “/usr/scratch4/samo4615/miniconda3/envs/envpy38/lib/python3.8/site-packages/torch/nn/modules/conv.py”, line 459, in _conv_forward
return F.conv2d(input, weight, bias, self.stride,
RuntimeError: no valid convolution algorithms available in CuDNN


More surprisingly - the exact same script runs fine when I launch it on a single computer rather than lsf cluster. Both cases use A100 and A40 GPUs.

Is it possible that tensorboard causes some conflicts that lead to this error? Though it still does not explain why training works fine on the other PC.
I just saw a notification on Pytorch-lightning training that tensorboard conflicts with other ML frameworks.

I don’t know what’s causing the issue in your environment, as it seems you are running into different random issues. I also don’t know what lsf is, but it seems to be related to your issues. Did you fix the indexing issue btw.?

I don’t think there is any indexing issue really. As I explained before, training runs perfectly fine when running on another workstation with A100. It just crashes on the LSF cluster.

But, thank you for trying to help!!

Best Regards
Sambit

I do think the indexing issue is real:

/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1673730874951/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [3946,0,0], thread: [28,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.

and should be fixed as the other CUDA/cuDNN issues could just re-raise from a sticky error which corrupts the CUDA context as seen here:

import torch
import torch.nn as nn

conv = nn.Conv2d(3, 3, 3).cuda()
x = torch.randn(1, 3, 224, 224).cuda()

# works
out = conv(x)
print(out.shape)
# torch.Size([1, 3, 222, 222])

# invalid index assert
a = torch.randn(10, 10).cuda()
a.gather(0, torch.tensor([[11]]).cuda())
# ../aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [0,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
# RuntimeError: CUDA error: device-side assert triggered

# re-raises the issue
out = conv(x)
# RuntimeError: cuDNN error: CUDNN_STATUS_MAPPING_ERROR

Okay, I will take another look at it. Even that seems hard to locate!! :smiley:

Hi, I also met the problem on 3090, did you solve it by uninstalling tensorboard on lsf?