Socket Timeout for distributed training

Hello,

We try to execute the distributed training on 32 nodes and each node can access 4 gpus. However, the code shows the RuntimeError: Socket Timeout for a specific epoch as follows:

Accuracy of the network on the 50000 test images: 0.6%
Max accuracy: 0.56%
Epoch: [3]  [ 0/78]  eta: 0:05:17  lr: 0.006401  loss: 6.8356 (6.8356)  time: 4.0712  data: 1.9933  max mem: 12282
Epoch: [3]  [10/78]  eta: 0:00:59  lr: 0.006401  loss: 6.8580 (6.8624)  time: 0.8734  data: 0.1814  max mem: 12282
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4286 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4287 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4289 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 4288) of binary: /global/homes/z/zw241/.conda/envs/pt-1.9/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 313.7377507686615 seconds
Traceback (most recent call last):
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 904, in _exit_barrier
    barrier_timeout=self._exit_barrier_timeout,
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
    )(*cmd_args)
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
    failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

We are not sure whether it is the issue of the platform or the issue of the pytorch library. We tried the 16 nodes and 8 nodes for the same training process, which looks good. Thanks for your help!

Thanks for posting @Zhe_Zhe. Can you explain more on your set up environment so that we can help more? Is this only happening in NCCL, what nccl version are you at?

Also, for good measure, you can try setting NCCL_DEBUG=INFO (also see Distributed communication package - torch.distributed — PyTorch 1.10.1 documentation) and check if you see anything out of the ordinary (perhaps some warnings or otherwise messages that indicate a problem with the topology).

There’s another github issue that have similar problem, you can check there as well to see if it helps Model parallel with DDP get `Socket Timeout` error when using NCCL, while GLOO works fine · Issue #25767 · pytorch/pytorch · GitHub

Thanks for the information, sorry for the late reply, some times it takes long time to get the resource allocation on our platform.

This is the version information

>>> import torch
>>> print(torch.cuda.nccl.version())
(2, 10, 3)
>>> print(torch.__version__)
1.10.1

This is the latest results with the NCCL_DEBUG=INFO , here are error messages:

ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 3 (pid: 3508) of binary: /global/homes/z/zw241/.conda/envs/pt-1.10/bin/python
nid002333:34633:34692 [0] transport/net_ib.cc:74 NCCL WARN NET/IB : Got async event : GID table change

nid002333:34632:34693 [0] transport/net_ib.cc:74 NCCL WARN NET/IB : Got async event : GID table change

nid002333:34631:34694 [0] transport/net_ib.cc:74 NCCL WARN NET/IB : Got async event : GID table change
...
nid002308:40058:40129 [0] transport/net_ib.cc:74 NCCL WARN NET/IB : Got async event : GID table change
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 302.4874405860901 seconds
Traceback (most recent call last):
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/elastic/agent/server/api.py", line 899, in _exit_barrier
    store_util.barrier(
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
    synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
    agent_data = get_all(store, key_prefix, world_size)
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/launch.py", line 193, in <module>
    main()
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/launch.py", line 189, in main
    launch(args)
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/launch.py", line 174, in launch
    run(args)
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/run.py", line 710, in run
    elastic_launch(
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/global/homes/z/zw241/.conda/envs/pt-1.10/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
main.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2022-02-07_15:19:57
  host      : nid002341
  rank      : 91 (local_rank: 3)
  exitcode  : 1 (pid: 3508)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================

Experiencing the same issue when trying to run on SLURM. Any idea what can be done to circumvent this @wanchaol? Happy to provide details about the environment and commands, etc.