Hi, It is strange that after upgrade torch from 1.4 to 1.9, the DDP training hangs at dist.barrier()
rather than kill when some error happend.
Below is an sample of the code:
model_prepare()
dist.barrier()
train_epoch()
dist.barrier()
validate()
It occurs OOM error duing training progress. However the DDP process hangs as below rather than just stop and killed:
RuntimeError: CUDA out of memory. Tried to allocate 330.00 MiB (GPU 0; 10.92 GiB total capacity; 8.75 GiB already allocated; 146.38 MiB free; 9.01 GiB reserved in total by PyTorch)
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 11607) of binary: /xxx/miniconda3/envs/torch190cu111/bin/python3
ERROR:torch.distributed.elastic.agent.server.local_elastic_agent:[default] Worker group failed
INFO:torch.distributed.elastic.agent.server.api:[default] Worker group FAILED. 1/1 attempts left; will restart worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Stopping worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=1
master_addr=127.0.0.1
master_port=29500
group_rank=0 group_world_size=1 local_ranks=[0, 1]
role_ranks=[0, 1] global_ranks=[0, 1] role_world_sizes=[2, 2]
global_world_sizes=[2, 2]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_dl0c_xte/none_rwikf9e7/attempt_1/1/error.json
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 1
[2021-07-21 17:36:32] INFO (torch.distributed.distributed_c10d/MainThread) Added key: store_based_barrier_key:1 to store for rank: 0
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:42] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:36:52] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:02] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 1, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
[2021-07-21 17:37:12] INFO (torch.distributed.distributed_c10d/MainThread) Waiting in store based barrier to initialize
process group for rank: 0, key: store_based_barrier_key:1 (world_size=2, worker_count=4, timeout=0:30:00)
...
I start DDP by running bash command:
CUDA_VISIBLE_DEVICES="0,1" python3 -m torch.distributed.launch --nproc_per_node 2 train.py <args>
How can I deal with this problem. I want the training process just killed after met some error like OOM rather than hanging forever.