Distributed training with GPU’s

  • /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
  • /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [40,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
  • /pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
  • Traceback (most recent call last):
  • File “main.py”, line 954, in
  • main()
  • File “main.py”, line 665, in main
  • optimizers=optimizers)
  • File “main.py”, line 773, in train_one_epoch
  • label_size=args.token_label_size)
  • File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 90, in mixup_target
  • y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size)
  • File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 64, in get_labelmaps_with_coords
  • num_classes=num_classes,device=device)
  • File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 12, in get_featuremaps
  • label_maps_topk_sizes[3]], 0, dtype=torch.float32 ,device=device)
  • RuntimeError: CUDA error: device-side assert triggered
  • terminate called after throwing an instance of ‘std::runtime_error’
  • what(): NCCL error in: /pytorch/torch/lib/c10d/…/c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

The stacktrace points to an invalid indexing operation.
You could try to rerun your script with CUDA_LAUNCH_BLOCKING=1, which should print the operation using the invalid index in the stacktrace directly. If that doesn’t work, try to use this env variable with a single GPU or run the script on the CPU to isolate the issue further.

Thanks for your reply. I did the operation like what you said, but I got the same debug report. Just like this :
Train: 0 [ 0/40036 ( 0%)] Loss: 10.430088 (10.4301) Time: 4.107s, 7.79/s (4.107s, 7.79/s) LR: 1.000e-06 Data: 2.876 (2.876)
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File “main.py”, line 949, in
main()
File “main.py”, line 664, in main
optimizers=optimizers)
File “main.py”, line 773, in train_one_epoch
label_size=args.token_label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 90, in mixup_target
y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 70, in get_labelmaps_with_coords
device=device)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 31, in get_label
1).float().to(device),
RuntimeError: CUDA error: device-side assert triggered

what’s more. I am running the code of volo #https://github.com/sail-sg/volo
with apex-amp, pytorch1.7.1, cuda10.1, torchvision0.8.2 in a single GPU.

Based on the stacktrace the blocking launch env variable doesn’t seem to be set or didn’t work properly, so you could either run the code on the CPU or synchronize manually the script (via torch.cuda.synchronize()) to further isolate the invalid index. Alternatively, using assert statements to check the input to indexing operations might also work.