Update timeout for pytorch ligthning ddp

I am trying to update the default distributed task timeout from 30 mins to 3 hours using
ce = pl.plugins.environments.lightning_environment.LightningEnvironment()
pl.utilities.distributed.init_dist_connection(cluster_environment=ce,torch_distributed_backend=‘nccl’,timeout=dt.timedelta(seconds=36060))
passing timeout as kwargs for torch.distributed.init_process_group but I am getting the below error -
pl.utilities.distributed.init_dist_connection(cluster_environment=ce,torch_distributed_backend=‘nccl’,timeout=dt.timedelta(seconds=36060))
File “/mnt/task_runtime/py36/lib/python3.6/site-packages/pytorch_lightning/utilities/distributed.py”, line 388, in init_dist_connection
torch_distributed_backend, rank=global_rank, world_size=world_size, **kwargs
File “/mnt/task_runtime/py36/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py”, line 500, in init_process_group
store, rank, world_size = next(rendezvous_iterator)
File “/mnt/task_runtime/py36/lib/python3.6/site-packages/torch/distributed/rendezvous.py”, line 190, in _env_rendezvous_handler
store = TCPStore(master_addr, master_port, world_size, start_daemon, timeout)
RuntimeError: Address already in use
How do I set the timeout?

This error is raised if the network address is already used by another process and unrelated to setting the timeout value, which looks correct.
Btw. you can also use timedelta(hours=3), which sounds quite excessive. Would you mind explaining why you are expecting such long timeouts in your training?

1 Like