Distributed training with GPU’s

JIAOJIAYUASD · August 16, 2021, 3:12am

/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [1,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [40,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [6,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File “main.py”, line 954, in
main()
File “main.py”, line 665, in main
optimizers=optimizers)
File “main.py”, line 773, in train_one_epoch
label_size=args.token_label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 90, in mixup_target
y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 64, in get_labelmaps_with_coords
num_classes=num_classes,device=device)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 12, in get_featuremaps
label_maps_topk_sizes[3]], 0, dtype=torch.float32 ,device=device)
RuntimeError: CUDA error: device-side assert triggered
terminate called after throwing an instance of ‘std::runtime_error’
what(): NCCL error in: /pytorch/torch/lib/c10d/…/c10d/NCCLUtils.hpp:136, unhandled cuda error, NCCL version 2.7.8

ptrblck · August 16, 2021, 3:15am

The stacktrace points to an invalid indexing operation.
You could try to rerun your script with CUDA_LAUNCH_BLOCKING=1, which should print the operation using the invalid index in the stacktrace directly. If that doesn’t work, try to use this env variable with a single GPU or run the script on the CPU to isolate the issue further.

JIAOJIAYUASD · August 18, 2021, 10:20am

Thanks for your reply. I did the operation like what you said, but I got the same debug report. Just like this :
Train: 0 [ 0/40036 ( 0%)] Loss: 10.430088 (10.4301) Time: 4.107s, 7.79/s (4.107s, 7.79/s) LR: 1.000e-06 Data: 2.876 (2.876)
/pytorch/aten/src/ATen/native/cuda/ScatterGatherKernel.cu:312: operator(): block: [0,0,0], thread: [25,0,0] Assertion idx_dim >= 0 && idx_dim < index_size && "index out of bounds" failed.
Traceback (most recent call last):
File “main.py”, line 949, in
main()
File “main.py”, line 664, in main
optimizers=optimizers)
File “main.py”, line 773, in train_one_epoch
label_size=args.token_label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 90, in mixup_target
y1 = get_labelmaps_with_coords(target, num_classes, on_value=on_value, off_value=off_value, device=device, label_size=label_size)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 70, in get_labelmaps_with_coords
device=device)
File “/opt/conda/lib/python3.6/site-packages/tlt/data/mixup.py”, line 31, in get_label
1).float().to(device),
RuntimeError: CUDA error: device-side assert triggered

what’s more. I am running the code of volo #https://github.com/sail-sg/volo
with apex-amp, pytorch1.7.1, cuda10.1, torchvision0.8.2 in a single GPU.

ptrblck · August 18, 2021, 6:19pm

Based on the stacktrace the blocking launch env variable doesn’t seem to be set or didn’t work properly, so you could either run the code on the CPU or synchronize manually the script (via torch.cuda.synchronize()) to further isolate the invalid index. Alternatively, using assert statements to check the input to indexing operations might also work.