I am trying to run my code on two servers having one GPU, I am trying a very simple code (that I already tested on a single machine with 2 GPU and works fine) I added some codes for the global rank and local rank to run on multi node form. but I am getting this error:
I use these commands to run my code:
considering first one for master node.
-
$ torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=10.30.7.22:29603 ddp-cifar100-multinode.py --epochs 10 --batch-size 16
-
$torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=10.30.7.22:29603 ddp-cifar100-multinode.py --epochs 10 --batch-size 16
ERROR:torch.distributed.elastic.multiprocessing.errors.error_handler:{
“message”: {
“message”: “RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.”,
“extraInfo”: {
“py_callstack”: “Traceback (most recent call last):\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 156, in _create_tcp_store\n host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)\nRuntimeError: connect() timed out. Original timeout was 60000 ms.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py", line 345, in wrapper\n return f(*args, **kwargs)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py", line 719, in main\n run(args)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py", line 713, in run\n )(*cmd_args)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 131, in call\n return launch_agent(self._config, self._entrypoint, list(args))\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py", line 228, in launch_agent\n rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler\n return handler_registry.create_handler(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler\n handler = creator(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler\n backend, store = create_backend(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 250, in create_backend\n store = _create_tcp_store(params)\n File "/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 177, in _create_tcp_store\n ) from exc\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n”,
“timestamp”: “1715699987”
}
}
}
Traceback (most recent call last):
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 156, in _create_tcp_store
host, port, is_master=is_server, timeout=timedelta(seconds=read_timeout)
RuntimeError: connect() timed out. Original timeout was 60000 ms.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File “/home/omid/omid/omid_env/bin/torchrun”, line 11, in
sys.exit(main())
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py”, line 345, in wrapper
return f(*args, **kwargs)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py”, line 719, in main
run(args)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/run.py”, line 713, in run
)(*cmd_args)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 131, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/launcher/api.py”, line 228, in launch_agent
rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py”, line 64, in get_rendezvous_handler
return handler_registry.create_handler(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/api.py”, line 253, in create_handler
handler = creator(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/registry.py”, line 35, in _create_c10d_handler
backend, store = create_backend(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 250, in create_backend
store = _create_tcp_store(params)
File “/home/omid/omid/omid_env/lib/python3.6/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py”, line 177, in _create_tcp_store
) from exc
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.