I am trying to train a model with PyTorch 1.8.1, but I run into a problem that causes my training to interrupt every 30 minutes ( I have to restart the training every 30 minutes from the checkout point):
...
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
dist.init_process_group(dist.init_process_group(
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 525, in init_process_group
_store_based_barrier(rank, store, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 6, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
_store_based_barrier(rank, store, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
_store_based_barrier(rank, store, timeout)
File "/usr/local/lib/python3.8/dist-packages/torch/distributed/distributed_c10d.py", line 212, in _store_based_barrier
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 7, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
raise RuntimeError(
RuntimeError: Timed out initializing process group in store based barrier on rank: 5, for key: store_based_barrier_key:1 (world_size=1, worker_count=8, timeout=0:30:00)
This seems to be related to the ‘timeout’ parameter which is set as ‘default_pg_timeout’(1800 seconds) in ‘init_process_group’ function by default, as suggested in the source code for ‘torch.distributed.distributed_c10d’.
So I modified my code as:
from datetime import timedelta
timeout=timedelta(seconds=86400) # default_pg_timeout is timedelta(seconds=1800)
...
dist.init_process_group(
backend=backend,
init_method='tcp://127.0.0.1:%d' % tcp_port,
rank=local_rank,
world_size=num_gpus,
timeout=timeout
)
But it didn’t work (maybe?). Could someone help me plz?