How to debug RuntimeError: CUDA error: device-side assert trigger?

2hyes · March 8, 2021, 9:16am

Hi. I’m trying yolov3 transfer learning with NWPU dataset.
I want to detect only person, so nwpu.names file is like below.

person

During training, I got Runtime error(RuntimeError: CUDA error: device-side assert triggered).
Detail error is like below.

Training Epoch 0:  76%|███████▌  | 87/115 [04:48<00:36,  1.31s/it]/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Training Epoch 0:  76%|███████▌  | 87/115 [05:07<01:38,  3.53s/it]
Traceback (most recent call last):
  File "trainer.py", line 235, in <module>
    trainer(opt.data_config, opt.multiscale_training, opt.img_size, opt.batch_size, opt.n_cpu, opt.model_def, opt.pretrained_weights,
  File "trainer.py", line 191, in trainer
    train_result = train(model, optimizer, train_dataloader, epoch, device,  gradient_accumulations)
  File "trainer.py", line 96, in train
    loss, outputs = model(imgs, targets)
  File "~~~/.pyenv/versions/yoloenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~~~/models.py", line 266, in forward
    x, layer_loss = module[0](x, targets, img_dim)
  File "~~~/.pyenv/versions/yoloenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~~~/vv-yolo/yolov3/models.py", line 191, in forward
    iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
  File "~~~/vv-yolo/yolov3/utils/utils.py", line 306, in build_targets
    noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0
RuntimeError: CUDA error: device-side assert triggered

Someone said it is input output dimension error, but I don’t think so.
I also adjusted filters to match the number of classes.(one class → filters right before yolo layer are 18.)
Someone said when using custom loss, there was an error and it was solved by adjusting the batch size. It doesn’t apply to me.

Can you have any solution for me? Thank you.

ptrblck · March 8, 2021, 9:22am

The error points towards a failed indexing operation:

Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.

Based on the stacktrace I assume it’s caused in this line of code:

noobj_mask[b[i], anchor_ious > ignore_thres, gj[i], gi[i]] = 0

To make sure it’s the right line of code causing this assertion, rerun the code with CUDA_LAUNCH_BLOCKING=1 python script.py args, which would show the right place in the stacktrace.

2hyes · March 8, 2021, 9:33am

@ptrblck

Training Epoch 0:  71%|████████████████████████████████████████▋                | 82/115 [05:18<01:00,  1.83s/it]/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [0,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [2,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [3,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [4,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [26,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/pytorch/aten/src/ATen/native/cuda/IndexKernel.cu:84: operator(): block: [0,0,0], thread: [28,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
Training Epoch 0:  71%|████████████████████████████████████████▋                | 82/115 [05:33<02:14,  4.07s/it]
Traceback (most recent call last):
  File "trainer.py", line 235, in <module>
    trainer(opt.data_config, opt.multiscale_training, opt.img_size, opt.batch_size, opt.n_cpu, opt.model_def, opt.pretrained_weights,
  File "trainer.py", line 191, in trainer
    train_result = train(model, optimizer, train_dataloader, epoch, device,  gradient_accumulations)
  File "trainer.py", line 96, in train
    loss, outputs = model(imgs, targets)
  File "~~~/.pyenv/versions/yoloenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "~~~~~/models.py", line 266, in forward
    x, layer_loss = module[0](x, targets, img_dim)
  File "~~~/.pyenv/versions/yoloenv/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File ~~~~/models.py", line 191, in forward
    iou_scores, class_mask, obj_mask, noobj_mask, tx, ty, tw, th, tcls, tconf = build_targets(
  File "~~~~~/utils/utils.py", line 301, in build_targets
    obj_mask[b, best_n, gj, gi] = 1
RuntimeError: CUDA error: device-side assert triggered

This is the result error of CUDA_LAUNCH_BLOCKING=1 python script.py args. I think it’s the right line of code causing this assertion, but I don’t know what error is this.

And surprisingly, this code runs well when I use coco dataset with only 1 class(person).

ptrblck · March 8, 2021, 10:05am

Yes, this line causes it:

obj_mask[b, best_n, gj, gi] = 1

Print all values of b, best_n, gj, and gi as well as obj_mask and make sure no values are out of bounds.

rinchenson · July 7, 2021, 9:13pm

Hi, Did you managed to fix your problem. I ran into similar problem. My code works well for 1 class but with multiple classes it crashes. Please let me know what was the root cause of this problem. Thank you so much