My server has 4 a4000 GPUs. I am currently training the model through ddp, but the following error occurs halfway through each training. How can I solve it?
Train Epoch: 4 [0/141886 (0%)] Loss: 0.373296
grad_norm: 6.0441 iteration: 53208
loss_radarClass:0.170 loss_radarOffset:0.133 loss_radarDepthOffset:0.071 loss:0.373
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 212680) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0
group_world_size=1
local_ranks=[0, 1, 2, 3]
role_ranks=[0, 1, 2, 3]
global_ranks=[0, 1, 2, 3]
role_world_sizes=[4, 4, 4, 4]
global_world_sizes=[4, 4, 4, 4]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/3/error.json
^CTraceback (most recent call last):
File "scripts/train_radiant_pgd.py", line 564, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
File "scripts/train_radiant_pgd.py", line 564, in <module>
File "scripts/train_radiant_pgd.py", line 564, in <module>
main(args)
File "scripts/train_radiant_pgd.py", line 351, in main
main(args)
File "scripts/train_radiant_pgd.py", line 351, in main
main(args)
File "scripts/train_radiant_pgd.py", line 351, in main
dist.init_process_group(backend='nccl')
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
dist.init_process_group(backend='nccl')
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
dist.init_process_group(backend='nccl')
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
time.sleep(0.01)
KeyboardInterrupt
time.sleep(0.01)time.sleep(0.01)
KeyboardInterruptKeyboardInterrupt
Traceback (most recent call last):
File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 173, in <module>
main()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 169, in main
run(args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
)(*cmd_args)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
result = agent.run()
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
result = f(*args, **kwargs)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
result = self._invoke_run(role)
File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 828, in _invoke_run
time.sleep(monitor_interval)
KeyboardInterrupt