Pytorch hangs after got error during DDP training

Hi, It is strange that after upgrade torch from 1.4 to 1.9, the DDP training hangs at dist.barrier() rather than kill when some error happend.
Below is an sample of the code:

model_prepare()
dist.barrier()
train_epoch()
dist.barrier()
validate()

It occurs OOM error duing training progress. However the DDP process hangs as below rather than just stop and killed:

RuntimeError: CUDA out of memory. Tried to allocate 330.00 MiB (GPU 0; 10.92 GiB total capacity; 8.75 GiB already allocated; 146.38 MiB free; 9.01 GiB reserved in total by PyTorch)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11607) of binary: /xxx/miniconda3/envs/torch190cu111/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/1 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0                                                                                                            group_world_size=1                                                                                                      local_ranks=[0, 1]
  role_ranks=[0, 1]                                                                                                       global_ranks=[0, 1]                                                                                                     role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]                                                                                             
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/1/error.json
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 1
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 0
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
...

I start DDP by running bash command:

CUDA_VISIBLE_DEVICES="0,1" python3 -m torch.distributed.launch --nproc_per_node 2 train.py <args>

How can I deal with this problem. I want the training process just killed after met some error like OOM rather than hanging forever.

Thanks for posting @sunshichen Could it be the case that only some process get OOM while other process still not and it just get hangs because it’s waiting for dist.barrier()? Or you are observing all processes get OOM and all hangs?

It would also be good if you can have a self contained small script to repro the issue so that we can help you debug it.

i have the same question,have you fixed it?

@supermilkflower can you try to run with TORCH_DISTRIBUTED_DEBUG=DETAIL and TORCH_SHOW_CPP_STACKTRACES=1?

If just one process OOMs while the others get to start the barrier, you’ll likely experience a hang as the non-OOM process have no way to monitor/supervise the faulty ones.

I am also having this issue when ever i use large size data. I’m training a Nemo model.

15247it [58:54,  4.53it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
14572it [58:55,  3.74it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
15334it [59:14,  4.30it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13797it [59:18,  3.78it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13836it [59:28,  3.86it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
15461it [59:45,  4.33it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13911it [59:47,  3.76it/s][NeMo W 2022-06-22 12:50:01 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py:420: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
      rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")

Error executing job with overrides: ['train_dataset=6670_manifest_other_train.json', 'validation_datasets=6670_manifest_other_dev.json']
15213it [59:40,  4.36it/s]Traceback (most recent call last):
  File "fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 246, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00
)

@m.aqeeljanjua Thanks for your comment! Could you please file a GH issue with a reproduction of your training script that hangs for your use case?

I will but This error only happens when I use large size data.

@rvarm1 This happend while dataloader is working.

`Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.                                                                                  [107/1982]
26469it [1:42:35,  4.30it/s]ot@7e179f0ee261:/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0#
50it [00:12,  4.16it/s]it/s]
26250it [1:42:52,  4.10it/s]Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
26469it [1:43:45,  4.25it/s]
50it [00:11,  4.19it/s]it/s]
25567it [1:43:43,  3.70it/s]Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/6
26469it [1:47:40,  4.10it/s]
26469it [1:47:49,  4.09it/s]
50it [00:11,  4.18it/s]it/s]
38it [00:09,  3.68it/s]Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/6
50it [00:13,  3.72it/s]it/s]
25224it [1:48:01,  3.62it/s]Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
26469it [1:53:41,  3.88it/s]
50it [00:13,  3.83it/s]
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/6
Error executing job with overrides: ['train_dataset=9017_manifest_clean_train.json', 'validation_datasets=9017_manifest_clean_dev.json']
Traceback (most recent call last):
  File "/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0/fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['train_dataset=9017_manifest_clean_train.json', 'validation_datasets=9017_manifest_clean_dev.json']
Traceback (most recent call last):
  File "/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0/fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
`

Based on your logs (can’t completely tell if it is the full log though), it seems that rank 0 did not log “Initializing Distributed” and there may be an issue where ranks are waiting for rank 0 to connect, but it never did. Could you check rank 0 logs to see if the process is active or if it crashed earlier?

similar thing has been happening with me lately…what happens is, i keep my model under training and leave it, while when it trains, after some time i noticed that my nvidia-smi doesn’t work and it shows an wierd error :- Unable to determine the device handle for GPU 0000:67:00.0: Unknown Error…do you think as my gpu is getting lost or something is happening related to that, as a result of that do you think, this below error gives

RuntimeError: NCCL communicator was aborted on rank 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ‘std::runtime_error’
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=4800000) ran for 4808523 milliseconds before timing out.