Pytorch hangs after got error during DDP training

Hi, It is strange that after upgrade torch from 1.4 to 1.9, the DDP training hangs at dist.barrier() rather than kill when some error happend.
Below is an sample of the code:

model_prepare()
dist.barrier()
train_epoch()
dist.barrier()
validate()

It occurs OOM error duing training progress. However the DDP process hangs as below rather than just stop and killed:

RuntimeError: CUDA out of memory. Tried to allocate 330.00 MiB (GPU 0; 10.92 GiB total capacity; 8.75 GiB already allocated; 146.38 MiB free; 9.01 GiB reserved in total by PyTorch)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11607) of binary: /xxx/miniconda3/envs/torch190cu111/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/1 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0                                                                                                            group_world_size=1                                                                                                      local_ranks=[0, 1]
  role_ranks=[0, 1]                                                                                                       global_ranks=[0, 1]                                                                                                     role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]                                                                                             
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/1/error.json
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 1
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 0
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
...

I start DDP by running bash command:

CUDA_VISIBLE_DEVICES="0,1" python3 -m torch.distributed.launch --nproc_per_node 2 train.py <args>

How can I deal with this problem. I want the training process just killed after met some error like OOM rather than hanging forever.

Thanks for posting @sunshichen Could it be the case that only some process get OOM while other process still not and it just get hangs because itā€™s waiting for dist.barrier()? Or you are observing all processes get OOM and all hangs?

It would also be good if you can have a self contained small script to repro the issue so that we can help you debug it.

i have the same question,have you fixed it?

@supermilkflower can you try to run with TORCH_DISTRIBUTED_DEBUG=DETAIL and TORCH_SHOW_CPP_STACKTRACES=1?

If just one process OOMs while the others get to start the barrier, youā€™ll likely experience a hang as the non-OOM process have no way to monitor/supervise the faulty ones.

I am also having this issue when ever i use large size data. Iā€™m training a Nemo model.

15247it [58:54,  4.53it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
14572it [58:55,  3.74it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
15334it [59:14,  4.30it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13797it [59:18,  3.78it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13836it [59:28,  3.86it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
15461it [59:45,  4.33it/s]Waiting in store based barrier to initialize process group for rank: 0, key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00)
13911it [59:47,  3.76it/s][NeMo W 2022-06-22 12:50:01 nemo_logging:349] /opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py:420: UserWarning: Error handling mechanism for deadlock detection is uninitialized. Skipping check.
      rank_zero_warn("Error handling mechanism for deadlock detection is uninitialized. Skipping check.")

Error executing job with overrides: ['train_dataset=6670_manifest_other_train.json', 'validation_datasets=6670_manifest_other_dev.json']
15213it [59:40,  4.36it/s]Traceback (most recent call last):
  File "fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 719, in _call_and_handle_interrupt
    return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch
    return function(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 610, in init_process_group
    _store_based_barrier(rank, store, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 246, in _store_based_barrier
    raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 0, for key: store_based_barrier_key:1 (world_size=7, worker_count=1, timeout=0:30:00
)

@m.aqeeljanjua Thanks for your comment! Could you please file a GH issue with a reproduction of your training script that hangs for your use case?

I will but This error only happens when I use large size data.

@rvarm1 This happend while dataloader is working.

`Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.                                                                                  [107/1982]
26469it [1:42:35,  4.30it/s]ot@7e179f0ee261:/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0#
50it [00:12,  4.16it/s]it/s]
26250it [1:42:52,  4.10it/s]Initializing distributed: GLOBAL_RANK: 2, MEMBER: 3/6
26469it [1:43:45,  4.25it/s]
50it [00:11,  4.19it/s]it/s]
25567it [1:43:43,  3.70it/s]Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/6
26469it [1:47:40,  4.10it/s]
26469it [1:47:49,  4.09it/s]
50it [00:11,  4.18it/s]it/s]
38it [00:09,  3.68it/s]Initializing distributed: GLOBAL_RANK: 5, MEMBER: 6/6
50it [00:13,  3.72it/s]it/s]
25224it [1:48:01,  3.62it/s]Initializing distributed: GLOBAL_RANK: 3, MEMBER: 4/6
26469it [1:53:41,  3.88it/s]
50it [00:13,  3.83it/s]
Initializing distributed: GLOBAL_RANK: 4, MEMBER: 5/6
Error executing job with overrides: ['train_dataset=9017_manifest_clean_train.json', 'validation_datasets=9017_manifest_clean_dev.json']
Traceback (most recent call last):
  File "/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0/fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
Set the environment variable HYDRA_FULL_ERROR=1 for a complete stack trace.
Error executing job with overrides: ['train_dataset=9017_manifest_clean_train.json', 'validation_datasets=9017_manifest_clean_dev.json']
Traceback (most recent call last):
  File "/workspace/share/docker2Training/SpeechData/LibriVox_TTS_dataset/hi_fi_tts_v0/fastpitch_align.py", line 31, in main
    trainer.fit(model)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 768, in fit
    self._call_and_handle_interrupt(
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 721, in _call_and_handle_interrupt
    return trainer_fn(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 809, in _fit_impl
    results = self._run(model, ckpt_path=self.ckpt_path)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1169, in _run
    self.strategy.setup_environment()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 151, in setup_environment
    self.setup_distributed()
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 191, in setup_distributed
    init_dist_connection(self.cluster_environment, self._process_group_backend)
  File "/opt/conda/lib/python3.8/site-packages/pytorch_lightning/utilities/distributed.py", line 355, in init_dist_connection
    torch.distributed.init_process_group(torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 578, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 229, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 157, in _create_c10d_store
    return TCPStore(
RuntimeError: connect() timed out. Original timeout was 1800000 ms.
`

Based on your logs (canā€™t completely tell if it is the full log though), it seems that rank 0 did not log ā€œInitializing Distributedā€ and there may be an issue where ranks are waiting for rank 0 to connect, but it never did. Could you check rank 0 logs to see if the process is active or if it crashed earlier?

similar thing has been happening with me latelyā€¦what happens is, i keep my model under training and leave it, while when it trains, after some time i noticed that my nvidia-smi doesnā€™t work and it shows an wierd error :- Unable to determine the device handle for GPU 0000:67:00.0: Unknown Errorā€¦do you think as my gpu is getting lost or something is happening related to that, as a result of that do you think, this below error gives

RuntimeError: NCCL communicator was aborted on rank 0.
[E ProcessGroupNCCL.cpp:325] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data. To avoid this inconsistency, we are taking the entire process down.
terminate called after throwing an instance of ā€˜std::runtime_errorā€™
what(): [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(OpType=ALLREDUCE, Timeout(ms)=4800000) ran for 4808523 milliseconds before timing out.

1 Like

Hello, I also have the same problem. Have you resolved it?

This looks like a hardware related problem as the card stopped working.

Iā€™d look at nvidia driver logs to see if thereā€™s anything there suggesting whatā€™s the reason.

tl;dr: Just call init_process_group in the beginning of your code so that dist.is_available() or dist.is_initialized() is true and no other open source library has to call init_process_group themselves.

The problem for me was that in my code there is a call to init_process_group and then destroy_process_group is called. When I call init_process_group again the code hangs.

That is because init_process_group counts the GPU workers that have reached this point in code and compares with GPU world size - code link
In init_process_group there is while loop that breaks when GPU worker count is equal to GPU world size.

The first time init_process_group is called it counts the number of workers correctly (for me it was 8) which matches the GPU world size, so it breaks out of the while loop. Then when destroy_process_group is called the worker count is not reset back to 0. So when init_process_group is called again then the GPU worker count ends becoming 16 which is not equal to world size. So the code just hangs and waits in the while loop for worker count to become equal to GPU world size, which never happens.

A concrete example -
I am using MosaicML Streaming library which calls init_process_group internally, but only if init_process_group is not called by anyone else. It then calls destroy_process_group
(See link1 in next reply)

Then when I use MMEngine for training it also calls init_process_group (via init_dist). And that call hangs because of the reason I explained above.
(See link2 in next reply)

So to overcome the issue in my code I make sure I call init_process_group myself in the beginning so that MosaicML doesnā€™t have to call it and destroy it.

The proper solution here would be to reset worker count in destroy_process_group function - code link

I will try to send out a PR to fix it for good. But the workaround I mentioned should work till then. Hope this helps!

1 Like

[in continuation to previous reply]

link1

link2