Thanks for your reply. I upgrade pytorch to 1.13.0 and cuda 11.7 but still have the problem.
[E ProcessGroupNCCL.cpp:821] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.s/it, total_it=18627]
[E ProcessGroupNCCL.cpp:821] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808635 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808721 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:821] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000599, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808928 milliseconds before timing out.
Traceback (most recent call last):
File "tools/train.py", line 220, in <module>
main()
File "tools/train.py", line 193, in main
logger=logger
File "/root/DSGN2/tools/train_utils/train_utils.py", line 141, in train_model
logger=logger
File "/root/DSGN2/tools/train_utils/train_utils.py", line 54, in train_one_epoch
loss.backward()
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/_tensor.py", line 488, in backward
self, gradient, retain_graph, create_graph, inputs=inputs
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/__init__.py", line 199, in backward
allow_unreachable=True, accumulate_grad=True) # Calls into the C++ engine to run the backward pass
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/autograd/function.py", line 267, in apply
return user_fn(self, *args)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/nn/modules/_functions.py", line 131, in backward
combined, torch.distributed.ReduceOp.SUM, process_group, async_op=False)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1536, in all_reduce
work = group.allreduce([tensor], opts)
RuntimeError: NCCL communicator was aborted on rank 3. Original reason for failure was: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for
1808622 milliseconds before timing out.
[E ProcessGroupNCCL.cpp:456] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[E ProcessGroupNCCL.cpp:461] To avoid data inconsistency, we are taking the entire process down.
terminate called after throwing an instance of 'std::runtime_error'
what(): [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3000600, OpType=ALLREDUCE, Timeout(ms)=1800000) ran for 1808622 milliseconds before timing out.
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401125 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401126 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 401127 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -6) local_rank: 3 (pid: 401128) of binary: /root/miniconda3/envs/dsgn/bin/python
Traceback (most recent call last):
File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/root/miniconda3/envs/dsgn/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 195, in <module>
main()
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 191, in main
launch(args)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launch.py", line 176, in launch
run(args)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/run.py", line 756, in run
)(*cmd_args)
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/root/miniconda3/envs/dsgn/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 248, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
=======================================================
tools/train.py FAILED
-------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
-------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2023-04-17_21:32:01
host : autodl-container-e12911b43c-3e093856
rank : 3 (local_rank: 3)
exitcode : -6 (pid: 401128)
error_file: <N/A>
traceback : Signal 6 (SIGABRT) received by PID 401128
=======================================================