Kuberflow jupyter notebook - RuntimeError: CUDA error: device-side assert triggered

torchstudent · March 16, 2023, 7:10pm

Epoch: [0] [ 0/34] eta: 0:00:19 lr: 0.001000 loss: 9.5855 (9.5855) loss_classifier: 0.7124 (0.7124) loss_box_reg: 0.0743 (0.0743) loss_keypoint: 8.0775 (8.0775) loss_objectness: 0.6929 (0.6929) loss_rpn_box_reg: 0.0285 (0.0285) time: 0.5787 data: 0.0540 max mem: 3215
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [24,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [25,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [26,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [27,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [28,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [29,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [30,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:142: operator(): block: [0,0,0], thread: [31,0,0] Assertion index >= -sizes[i] && index < sizes[i] && "index out of bounds" failed.
Traceback (most recent call last):
File “traine.py”, line 155, in
train_one_epoch(model, optimizer, data_loader_train, device, epoch, print_freq=1000)
File “/home/targetdir/keypoint_rcnn_training_pytorch-main/engine.py”, line 31, in train_one_epoch
loss_dict = model(images, targets)
File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/torchvision/models/detection/generalized_rcnn.py”, line 97, in forward
proposals, proposal_losses = self.rpn(images, features, targets)
File “/opt/conda/lib/python3.8/site-packages/torch/nn/modules/module.py”, line 889, in _call_impl
result = self.forward(*input, **kwargs)
File “/opt/conda/lib/python3.8/site-packages/torchvision/models/detection/rpn.py”, line 364, in forward
loss_objectness, loss_rpn_box_reg = self.compute_loss(
File “/opt/conda/lib/python3.8/site-packages/torchvision/models/detection/rpn.py”, line 296, in compute_loss
sampled_pos_inds, sampled_neg_inds = self.fg_bg_sampler(labels)
File “/opt/conda/lib/python3.8/site-packages/torchvision/models/detection/_utils.py”, line 60, in call
neg_idx_per_image = negative[perm2]
RuntimeError: CUDA error: device-side assert triggered

Hello everyone,

I am getting the above error for single class object detections and segmentations, I mean where there is only one class of objects to detect and train. The error is occurring in kubeflow notebooks when I train. I already tried the - os.environ[‘CUDA_LAUNCH_BLOCKING’] = “1”
and !export CUDA_LAUNCH_BLOCKING=1.

Can anyone help me with this please?

ptrblck · March 16, 2023, 9:40pm

Note that these environment variables are used to isolate the issue further by blocking each CUDA kernel execution, which should then point to the line of code raising the error. It’s not a mechanism to solve the issue.

Based on the error message an indexing operation uses invalid indices and fails.
Could you check the shape of negative and the shape as well as the min/max values of perm2 in:

neg_idx_per_image = negative[perm2]