Device Side Assert on the backward() call

hadarshavit · February 20, 2023, 10:46am

I have the following error:

Traceback (most recent call last):
  File "/home/s3092593/as-graph/slurm/../train.py", line 204, in <module>
    train(config)
  File "/home/s3092593/as-graph/slurm/../train.py", line 100, in train
    train_loss, train_runtime, train_accuracy, train_dist = train(model, train_loader, optimizer, criterion, runtime_evaluator, accuracy_evaluator,
  File "/home/s3092593/as-graph/deeper_gnn/train.py", line 76, in train
    scaler.scale(loss).backward()
  File "/data1/s3092593/thesis/lib/python3.10/site-packages/torch/_tensor.py", line 488, in backward
    torch.autograd.backward(
  File "/data1/s3092593/thesis/lib/python3.10/site-packages/torch/autograd/__init__.py", line 197, in backward
    Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered

I ran the code with CUDA_LAUNCH_BLOCKING=1. The error happens after many batches (but the exact number changes from run to run)
When it happens, it also prints many asserts:

...
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [57,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [58,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [59,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [60,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [61,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [62,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
/opt/conda/conda-bld/pytorch_1670525541990/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [0,0,0], thread: [63,0,0] Assertion `idx_dim >= 0 && idx_dim < index_size && "index out of bounds"` failed.
...

However, I cannot find the source of the error.

hadarshavit · February 20, 2023, 3:02pm

The error happens only on the GPU and not on the CPU

ptrblck · February 20, 2023, 9:09pm

A scatter or gather operation is failing with an invalid index. Check the stacktrace to see which operation fails and then make sure the index tensor contains valid indices in expected range.

nuo_chen · December 6, 2023, 2:24pm

/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664415167092/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1663,0,0], thread: [88,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664415167092/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1663,0,0], thread: [89,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664415167092/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1663,0,0], thread: [90,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/home/conda/feedstock_root/build_artifacts/pytorch-recipe_1664415167092/work/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:144: operator(): block: [1663,0,0], thread: [91,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File “/media/yons/new/Code/AOBnet_v3/train_flow.py”, line 101, in
loss.backward()
File “/home/yons/anaconda3/envs/DCEIFlow/lib/python3.8/site-packages/torch/_tensor.py”, line 396, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, inputs=inputs)
File “/home/yons/anaconda3/envs/DCEIFlow/lib/python3.8/site-packages/torch/autograd/init.py”, line 173, in backward
Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: CUDA error: device-side assert triggered
Epoch: 0: 26%|鈻堚枅鈻? | 77.0/295 [00:14<00:42, 5.14Batch/s, Total_Loss=40e5, loss_cm=7.61e5, loss_smoth=39.3e5, loss_value=46.9e5]

i met the same problem! here is the code

def interpolate(idx, weights, res, polarity_mask=None):
“”"
Create an image-like representation of the warped events.
:param idx: [batch_size x N x 1] warped event locations
:param weights: [batch_size x N x 1] interpolation trained_models for the warped events
:param res: resolution of the image space
:param polarity_mask: [batch_size x N x 2] polarity mask for the warped events (default = None)
:return image of warped events
“”"
if polarity_mask is not None:
# polarity_mask = polarity_mask.to(‘cpu’)
weights = weights * polarity_mask
iwe = torch.zeros((idx.shape[0], res[0] * res[1], 1)).to(‘cuda:0’)
idx = idx.long().clip(0, iwe.shape[1] -1) # make index tensor valis
iwe = iwe.scatter_add_(1, idx, weights) #iwe(1,720*1280,1)
iwe = iwe.view((idx.shape[0], 1, res[0], res[1]))
return iwe

ptrblck · December 6, 2023, 2:26pm

Your code is not executable, so could you add the missing parts and definitions to reproduce the issue?

nuo_chen · December 6, 2023, 2:32pm

Sorry, there are too many complete codes to upload. But I have limited the range of the index, and it still reports an error overflow, which I cannot understand. Additionally, when I run the code on the CPU, there are no errors reported