Hello,
We try to execute the distributed training on 32 nodes and each node can access 4 gpus. However, the code shows the RuntimeError: Socket Timeout
for a specific epoch as follows:
Accuracy of the network on the 50000 test images: 0.6%
Max accuracy: 0.56%
Epoch: [3] [ 0/78] eta: 0:05:17 lr: 0.006401 loss: 6.8356 (6.8356) time: 4.0712 data: 1.9933 max mem: 12282
Epoch: [3] [10/78] eta: 0:00:59 lr: 0.006401 loss: 6.8580 (6.8624) time: 0.8734 data: 0.1814 max mem: 12282
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4286 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4287 closing signal SIGTERM
WARNING:torch.distributed.elastic.multiprocessing.api:Sending process 4289 closing signal SIGTERM
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 2 (pid: 4288) of binary: /global/homes/z/zw241/.conda/envs/pt-1.9/bin/python
ERROR:torch.distributed.elastic.agent.server.api:Error waiting on exit barrier. Elapsed: 313.7377507686615 seconds
Traceback (most recent call last):
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/agent/server/api.py", line 904, in _exit_barrier
barrier_timeout=self._exit_barrier_timeout,
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 67, in barrier
synchronize(store, data, rank, world_size, key_prefix, barrier_timeout)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 53, in synchronize
agent_data = get_all(store, key_prefix, world_size)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/elastic/utils/store.py", line 31, in get_all
data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout
Traceback (most recent call last):
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 193, in <module>
main()
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 189, in main
launch(args)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launch.py", line 174, in launch
run(args)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/run.py", line 713, in run
)(*cmd_args)
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 131, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/global/homes/z/zw241/.conda/envs/pt-1.9/lib/python3.7/site-packages/torch/distributed/launcher/api.py", line 261, in launch_agent
failures=result.failures,
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
We are not sure whether it is the issue of the platform or the issue of the pytorch library. We tried the 16 nodes and 8 nodes for the same training process, which looks good. Thanks for your help!