Hello, I am trying to use Distributed Data Parallel to train a model with multiple nodes (each having at least one GPU). After several attempts to train my own model failed, I decided to test PyTorch’s Github demo program for multi-node training. I ran this command, as given in PyTorch’s YouTube tutorial, on the host node:
torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10
Nothing was printed to the screen, but I assume that it was waiting for a connection. On the client node, I ran this:
torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10
After about a minute, I got this error from it:
DemoUser@DemoDesktop:/home/DemoUser/DDP Development$ torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING]
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] *****************************************
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] *****************************************
[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.***.***, 7777).
Traceback (most recent call last):
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
store = TCPStore(
^^^^^^^^^
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.***.***, 7777).
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/DemoUser/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
return handler_registry.create_handler(params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258, in create_handler
handler = creator(params)
^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
backend, store = create_backend(params)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 249, in create_backend
store = _create_tcp_store(params)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
I have been unable to find a solution to the problem for several days. Port 7777 is open on both machines, and they also have the latest version of PyTorch. Any help is greatly appreciated!