Background:
When training the model, it runs fine on a single GPU. However, when using 2 or more GPUs, errors occur. Each error occurs at the end of training one epoch.
The error on the master node occurs at default_pg.allreduce
, while the error messages on the worker nodes are not consistent. Sometimes the error occurs at norm_layer
in swin_transformer
, sometimes at fc
, and sometimes at attn_mask.masked_fill
.
Here is training command
python3.7 -m torch.distributed.launch \
--nproc_per_node=2 \
--nnodes=1 \
main.py \
--data_root /mnt/common/Next-Generation-OCR/dataset/src/ \
--anno_root /mnt/common/Next-Generation-OCR/dataset/src/ \
--output_folder /mnt/common/wangpeng/research/exp/sptsOCR/debug/ \
--train_dataset ctw1500 \
--lr 0.0005 \
--max_steps 1000000 \
--warmup_steps 5000 \
--checkpoint_freq 20 \
--checkpoint_freq_step 20000 \
--batch_size 1 \
--tfm_pre_norm \
--train_max_size 768 \
--rec_loss_weight 2 \
--num_workers 0 \
--image \
--seed 253 \
--max_prompt 8 \
--granularity word
Here is master error:
poch: [13] Total time: 0:05:03 (1.2123 s / it)
[E ProcessGroupNCCL.cpp:587] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
Traceback (most recent call last):
File "main.py", line 100, in <module>
main(args)
File "main.py", line 73, in main
train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 101, in train_one_epoch
metric_logger.synchronize_between_processes()
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/utils/logger.py", line 100, in synchronize_between_processes
meter.synchronize_between_processes()
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/utils/logger.py", line 34, in synchronize_between_processes
dist.all_reduce(t)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 1285, in all_reduce
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 0. Original reason for failure was: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:341] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=450000) ran for 459084 milliseconds before timing out.
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 0 (pid: 299) of binary: /home/pai/bin/python3
Traceback (most recent call last):
File "/home/pai/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/home/pai/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/home/pai/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
Here is worker error:
/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [29,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [30,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
/workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/aten/src/ATen/native/cuda/IndexKernel.cu:93: operator(): block: [2,0,0], thread: [31,0,0] Assertion `index >= -sizes[i] && index < sizes[i] && "index out of bounds"` failed.
damo-0138:295:427 [0] init.cc:988 NCCL WARN Cuda failure 'device-side assert triggered'
Traceback (most recent call last):
File "main.py", line 100, in <module>
main(args)
File "main.py", line 73, in main
train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
outputs = model(samples, input_seqs_, mask_prompts)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pai/lib/python3.6/site-packages/torch/nn/parallel/distributed.py", line 886, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
features, pos = self.backbone(samples)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
xs = self[0](tensor_list)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/mnt/user/E-yuekun.wp-313300/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 618, in forward
x_out = norm_layer(x_out)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
return forward_call(*input, **kwargs)
File "/home/pai/lib/python3.6/site-packages/torch/nn/modules/normalization.py", line 190, in forward
input, self.normalized_shape, self.weight, self.bias, self.eps)
File "/home/pai/lib/python3.6/site-packages/torch/nn/functional.py", line 2347, in layer_norm
return torch.layer_norm(input, normalized_shape, weight, bias, eps, torch.backends.cudnn.enabled)
RuntimeError: CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
terminate called after throwing an instance of 'c10::CUDAError'
what(): CUDA error: device-side assert triggered
CUDA kernel errors might be asynchronously reported at some other API call,so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Exception raised from create_event_internal at /workspace/artifacts/paipytorch1.10/dist/ubuntu18.04-py3.6-cuda11.3/build/src/c10/cuda/CUDACachingAllocator.cpp:1211 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x60 (0x7f6465886750 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x1df9e (0x7f64ab98bf9e in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #2: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0x1bb (0x7f64ab98d26b in /home/pai/lib/python3.6/site-packages/torch/lib/libc10_cuda.so)
frame #3: c10::TensorImpl::release_resources() + 0xa4 (0x7f646586d8a4 in /home/pai/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #4: <unknown function> + 0x296a19 (0x7f64b8f5fa19 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0xb1e385 (0x7f64b97e7385 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #6: THPVariable_subclass_dealloc(_object*) + 0x2c9 (0x7f64b97e76d9 in /home/pai/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #7: <unknown function> + 0xf2ec7 (0x560870d7eec7 in /home/pai/bin/python3)
frame #8: <unknown function> + 0xf2ec7 (0x560870d7eec7 in /home/pai/bin/python3)
frame #9: <unknown function> + 0xf2787 (0x560870d7e787 in /home/pai/bin/python3)
frame #10: <unknown function> + 0xf2617 (0x560870d7e617 in /home/pai/bin/python3)
frame #11: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #12: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #13: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #14: <unknown function> + 0xf262d (0x560870d7e62d in /home/pai/bin/python3)
frame #15: PyDict_SetItem + 0x3da (0x560870dc54ba in /home/pai/bin/python3)
frame #16: PyDict_SetItemString + 0x4f (0x560870dcc4df in /home/pai/bin/python3)
frame #17: PyImport_Cleanup + 0x99 (0x560870e31d49 in /home/pai/bin/python3)
frame #18: Py_FinalizeEx + 0x61 (0x560870e9c061 in /home/pai/bin/python3)
frame #19: Py_Main + 0x35e (0x560870ea63ae in /home/pai/bin/python3)
frame #20: main + 0xee (0x560870d7043e in /home/pai/bin/python3)
frame #21: __libc_start_main + 0xe7 (0x7f64edca8bf7 in /lib/x86_64-linux-gnu/libc.so.6)
frame #22: <unknown function> + 0x1c3d0b (0x560870e4fd0b in /home/pai/bin/python3)
Sometimes the error is like this:
dsw64682-57b6f9d7bc-lr6zq:16471:16480 [1] init.cc:924 NCCL WARN Cuda failure 'device-side assert triggered'
Traceback (most recent call last):
File "main.py", line 100, in <module>
main(args)
File "main.py", line 73, in main
train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
outputs = model(samples, input_seqs_, mask_prompts)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted.
result = self.forward(*input, **kwargs)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
[E ProcessGroupNCCL.cpp:294] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
features, pos = self.backbone(samples)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
xs = self[0](tensor_list)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 614, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 387, in forward
attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0))
or:
dsw64682-57b6f9d7bc-lr6zq:26670:26678 [1] init.cc:924 NCCL WARN Cuda failure 'device-side assert triggered'
main(args)
File "main.py", line 73, in main
train_stats, global_step = train_one_epoch(model, train_dataloader, criterion, optimizer, lr_scheduler, epoch, global_step, checkpointer, checkpoint_folder, args)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/engine/train.py", line 38, in train_one_epoch
outputs = model(samples, input_seqs_, mask_prompts)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
work = default_pg.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted.
result = self.forward(*input, **kwargs)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/parallel/distributed.py", line 705, in forward
output = self.module(*inputs[0], **kwargs[0])
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/spts.py", line 16, in forward
features, pos = self.backbone(samples)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/joiner.py", line 11, in forward
xs = self[0](tensor_list)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 614, in forward
x_out, H, W, x, Wh, Ww = layer(x, Wh, Ww)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 394, in forward
x = blk(x, attn_mask)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 251, in forward
x = x + self.drop_path(self.mlp(self.norm2(x)))
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/workspace/yuekun_ex/gitrepo/research-yuekun/promptable-ocr/tmp/sptsOCR/model/backbone/swin_transformer.py", line 31, in forward
x = self.fc1(x)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
result = self.forward(*input, **kwargs)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 94, in forward
return F.linear(input, self.weight, self.bias)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/nn/functional.py", line 1753, in linear
return torch._C._nn.linear(input, weight, bias)
RuntimeError: CUDA error: CUBLAS_STATUS_INTERNAL_ERROR when calling `cublasSgemm( handle, opa, opb, m, n, k, &alpha, a, lda, b, ldb, &beta, c, ldc)`
Killing subprocess 26669
Killing subprocess 26670
Traceback (most recent call last):
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 340, in <module>
main()
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 326, in main
sigkill_handler(signal.SIGTERM, None) # not coming back
File "/mnt/common/wangpeng/conda/envs/uniocr/lib/python3.7/site-packages/torch/distributed/launch.py", line 301, in sigkill_handler
raise subprocess.CalledProcessError(returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/mnt/common/wangpeng/conda/envs/uniocr/bin/python3.7', '-u', 'main.py', '--local_rank=1', '--data_root', '/mnt/common/Next-Generation-OCR/dataset/src/', '--anno_root', '/mnt/common/Next-Generation-OCR/dataset/src/', '--output_folder', '/mnt/common/wangpeng/research/exp/sptsOCR/debug/', '--train_dataset', 'ic15', '--lr', '0.0005', '--max_steps', '1000000', '--warmup_steps', '5000', '--checkpoint_freq', '20', '--checkpoint_freq_step', '20000', '--batch_size', '1', '--tfm_pre_norm', '--train_max_size', '768', '--rec_loss_weight', '2', '--num_workers', '0', '--image', '--seed', '253', '--max_prompt', '8', '--granularity', 'word']' returned non-zero exit status 1.