Hi,
I’m trying to train a model on a K8S GPU cluster where I can store docker images before training. Each GPU node will pull the image and create its own environment upon a training job creation. I have a relatively large image so it usually takes a bit longer for the nodes to pull the image. It sometimes happens that some nodes will pull the image faster and wait for other nodes to initialize. After a while, these already running nodes will fail and return a runtime error of Socket Timeout because they cannot communicate with the nodes that are still pulling the image. The backend of pytorch ddp is nccl.
Traceback (most recent call last):
File "/conda/envs/animatediff/lib/python3.10/runpy.py", line 196, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/conda/envs/animatediff/lib/python3.10/runpy.py", line 86, in _run_code
exec(code, run_globals)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
main()
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
return f(*args, **kwargs)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
run(args)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
elastic_launch(
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
result = agent.run()
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
result = self._invoke_run(role)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
self._initialize_workers(self._worker_group)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
self._rendezvous(worker_group)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous
workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
result = f(*args, **kwargs)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks
role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather
role_infos_bytes = store_util.synchronize(
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
agent_data = get_all(store, rank, key_prefix, world_size)
File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Are there any methods to make this timeout longer so that the already running nodes can wait for the other nodes for a longer period?