Pytorch hangs after got error during DDP training

Hi, It is strange that after upgrade torch from 1.4 to 1.9, the DDP training hangs at dist.barrier() rather than kill when some error happend.
Below is an sample of the code:

model_prepare()
dist.barrier()
train_epoch()
dist.barrier()
validate()

It occurs OOM error duing training progress. However the DDP process hangs as below rather than just stop and killed:

RuntimeError: CUDA out of memory. Tried to allocate 330.00 MiB (GPU 0; 10.92 GiB total capacity; 8.75 GiB already allocated; 146.38 MiB free; 9.01 GiB reserved in total by PyTorch)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11607) of binary: /xxx/miniconda3/envs/torch190cu111/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/1 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=1
  master_addr=127.0.0.1
  master_port=29500
  group_rank=0                                                                                                            group_world_size=1                                                                                                      local_ranks=[0, 1]
  role_ranks=[0, 1]                                                                                                       global_ranks=[0, 1]                                                                                                     role_world_sizes=[2, 2]
  global_world_sizes=[2, 2]                                                                                             
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/1/error.json
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 1
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 0
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
...

I start DDP by running bash command:

CUDA_VISIBLE_DEVICES="0,1" python3 -m torch.distributed.launch --nproc_per_node 2 train.py <args>

How can I deal with this problem. I want the training process just killed after met some error like OOM rather than hanging forever.

Thanks for posting @sunshichen Could it be the case that only some process get OOM while other process still not and it just get hangs because it’s waiting for dist.barrier()? Or you are observing all processes get OOM and all hangs?

It would also be good if you can have a self contained small script to repro the issue so that we can help you debug it.