DDP Socket Timeout because nodes are waiting for other nodes pulling docker image on a K8S cluster

wren93 · November 11, 2023, 6:07am

Hi,

I’m trying to train a model on a K8S GPU cluster where I can store docker images before training. Each GPU node will pull the image and create its own environment upon a training job creation. I have a relatively large image so it usually takes a bit longer for the nodes to pull the image. It sometimes happens that some nodes will pull the image faster and wait for other nodes to initialize. After a while, these already running nodes will fail and return a runtime error of Socket Timeout because they cannot communicate with the nodes that are still pulling the image. The backend of pytorch ddp is nccl.

Traceback (most recent call last):
  File "/conda/envs/animatediff/lib/python3.10/runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/conda/envs/animatediff/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 798, in <module>
    main()
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 794, in main
    run(args)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/run.py", line 785, in run
    elastic_launch(
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 241, in launch_agent
    result = agent.run()
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 723, in run
    result = self._invoke_run(role)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 858, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 692, in _initialize_workers
    self._rendezvous(worker_group)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous
    workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 624, in _assign_worker_ranks
    role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 661, in _share_and_gather
    role_infos_bytes = store_util.synchronize(
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/conda/envs/animatediff/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout

Are there any methods to make this timeout longer so that the already running nodes can wait for the other nodes for a longer period?