Training with 4 gpu in DDP

My server has 4 a4000 GPUs. I am currently training the model through ddp, but the following error occurs halfway through each training. How can I solve it?

Train Epoch: 4 [0/141886 (0%)]  Loss: 0.373296
grad_norm: 6.0441 iteration: 53208
loss_radarClass:0.170  loss_radarOffset:0.133  loss_radarDepthOffset:0.071  loss:0.373  
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: -9) local_rank: 2 (pid: 212680) of binary: /opt/conda/bin/python
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 3/3 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0
  group_world_size=1
  local_ranks=[0, 1, 2, 3]
  role_ranks=[0, 1, 2, 3]
  global_ranks=[0, 1, 2, 3]
  role_world_sizes=[4, 4, 4, 4]
  global_world_sizes=[4, 4, 4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/2/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker3 reply file to: /tmp/torchelastic_47alj7h1/none_c5ab1jtk/attempt_1/3/error.json
^CTraceback (most recent call last):
  File "scripts/train_radiant_pgd.py", line 564, in <module>
Traceback (most recent call last):
Traceback (most recent call last):
  File "scripts/train_radiant_pgd.py", line 564, in <module>
  File "scripts/train_radiant_pgd.py", line 564, in <module>
    main(args)
  File "scripts/train_radiant_pgd.py", line 351, in main
    main(args)
  File "scripts/train_radiant_pgd.py", line 351, in main
    main(args)
  File "scripts/train_radiant_pgd.py", line 351, in main
    dist.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    dist.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    dist.init_process_group(backend='nccl')
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 547, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
    _store_based_barrier(rank, store, timeout)
      File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 207, in _store_based_barrier
    time.sleep(0.01)
KeyboardInterrupt        
time.sleep(0.01)time.sleep(0.01)

KeyboardInterruptKeyboardInterrupt

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/opt/conda/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 173, in <module>
    main()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launch.py", line 169, in main
    run(args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/run.py", line 624, in run
    )(*cmd_args)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    result = agent.run()
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/metrics/api.py", line 125, in wrapper
    result = f(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 700, in run
    result = self._invoke_run(role)
  File "/opt/conda/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 828, in _invoke_run
    time.sleep(monitor_interval)
KeyboardInterrupt

This should indicate the Python process was killed via SIGKILL which is often done by the OS if you are running out of memory on the host. Check if that’s the case and reduce the memory usage if needed.

Try checking whether the process on the GPU has actually been killed by examining the memory load and PID watch -n 1 nvidia-smi.Often one of the processes has not been terminated when an error occurs. So if there is memory allocation on the GPU, even after you terminated training or due to automatic error termination, kill the process using: kill -9 PID. You can find the PID that is running on each gpu on the server cluster at the bottom tabular in nvidia-smi.