I’ve just got my hands on two workstations with a pair of GPUs each and I have been trying to run distributed training across them both.
Training works on a singular machine with both GPUs active, but I’ve be unsuccessful in utilizing both. I keep getting RuntimeError: Connection reset by peer and I’m not entirely sure what to do (full error last paragraph).
I’m running on a docker container based off nvcr.io/nvidia/pytorch:21.07-py3 with an example of the run command below, the host system is Ubuntu 20.04 server (Docker version 20.10.7, build f0df350). Both systems are just connected to the University network (ports are adjacent to each other, but obivously, who knows if they’re connected to the same switch or any other firewall settings).
docker run -d -it --gpus all --shm-size 16G -p 29400:29400 \
--mount type=bind,source=/datasets,target=/tmp/training_data,readonly \
image:tag -m torch.distributed.run --nnodes=2 --nproc_per_node=2 \
--rdzv_id='1234' --rdzv_backend='c10d' --rdzv_endpoint='system_a_ip' \
train_detached.py --backend 'nccl' --other-args...
The firewall is disabled on both systems
sudo ufw status
Status: inactive
[ERROR] 2021-08-02 08:12:35,300 error_handler: {
"message": {
"message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
"extraInfo": {
"py_callstack": "Traceback (most recent call last):\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 104, in _call_store\n return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Connection reset by peer\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 351, in wrapper\n return f(*args, **kwargs)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 214, in launch_agent\n rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 64, in get_rendezvous_handler\n return handler_registry.create_handler(params)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py\", line 253, in create_handler\n handler = creator(params)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 35, in _create_c10d_handler\n backend, store = create_backend(params)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 239, in create_backend\n return C10dRendezvousBackend(store, params.run_id), store\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 55, in __init__\n self._call_store(\"compare_set\", self._key, \"\", self._NULL_SENTINEL)\n File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 106, in _call_store\n raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
"timestamp": "1627891955"
}
}
}
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 104, in _call_store
return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 638, in <module>
main()
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 630, in main
run(args)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 622, in run
elastic_launch(
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper
return f(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 214, in launch_agent
rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
return handler_registry.create_handler(params)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
handler = creator(params)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
backend, store = create_backend(params)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 239, in create_backend
return C10dRendezvousBackend(store, params.run_id), store
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 55, in __init__
self._call_store("compare_set", self._key, "", self._NULL_SENTINEL)
File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 106, in _call_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.