DDP not connecting on local machines with C10d

TheVictor_777 · February 20, 2024, 8:24pm

Hello, I am trying to use Distributed Data Parallel to train a model with multiple nodes (each having at least one GPU). After several attempts to train my own model failed, I decided to test PyTorch’s Github demo program for multi-node training. I ran this command, as given in PyTorch’s YouTube tutorial, on the host node:

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10

Nothing was printed to the screen, but I assume that it was waiting for a connection. On the client node, I ran this:

torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10

After about a minute, I got this error from it:

DemoUser@DemoDesktop:/home/DemoUser/DDP Development$ torchrun --nproc_per_node=2 --nnodes=2 --node_rank=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] 
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] *****************************************
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
[2024-02-20 14:59:01,197] torch.distributed.run: [WARNING] *****************************************
[E socket.cpp:957] [c10d] The client socket has timed out after 60s while trying to connect to (192.168.***.***, 7777).
Traceback (most recent call last):
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 155, in _create_tcp_store
    store = TCPStore(
            ^^^^^^^^^
torch.distributed.DistNetworkError: The client socket has timed out after 60s while trying to connect to (192.168.***.***, 7777).

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/DemoUser/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 238, in launch_agent
    rdzv_handler=rdzv_registry.get_rendezvous_handler(rdzv_parameters),
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 66, in get_rendezvous_handler
    return handler_registry.create_handler(params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/api.py", line 258, in create_handler
    handler = creator(params)
              ^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 36, in _create_c10d_handler
    backend, store = create_backend(params)
                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 249, in create_backend
    store = _create_tcp_store(params)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/DemoUser/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 175, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

I have been unable to find a solution to the problem for several days. Port 7777 is open on both machines, and they also have the latest version of PyTorch. Any help is greatly appreciated!

patataman · February 21, 2024, 9:42am

I used to have similar problems with RPC. If your nodes are not connected through eth0, probably DDP is trying that interface by default and that’s the cause. My solution was explicitly telling the network interface (eth0, enpsomething…), like in Getting Gloo error when connecting server and client over VPN from different systems - #2 by selineni

TheVictor_777 · February 21, 2024, 12:40pm

Each of my nodes has a single wired Ethernet connection. Thus, that isn’t likely the problem.

patataman · February 21, 2024, 1:53pm

Hmmm, sadly I’m not an expert to propose other possible solution…

TheVictor_777 · February 22, 2024, 2:15am

I found that my client node has at least three versions of Python installed: 2.7, 3.10, and 3.12 (which I use). Could that be a problem?
Also, thorough testing has revealed that when I run the host machine’s script, it seems to behave somewhat correctly. The client PC fails to connect. Further, when I reverse the roles so that the client takes the role of host, it fails to connect… to itself. The other node seems to correctly attempt to connect. Thus, I theorize that my main host node works perfectly, but the worker fails.
Lastly, modifying the run command on both machines affects the error message.

TheVictor_777 · February 25, 2024, 4:35pm

UPDATE
I found that my training port (7777) was open, but not to TCP. With it open correctly on both nodes, they print nothing and wait for about fifteen minutes before returning:

Host Node

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=0 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10

Traceback (most recent call last):
  File "/home/VictorUbuntu/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 806, in main
    run(args)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/run.py", line 797, in run
    elastic_launch(
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 134, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/launcher/api.py", line 255, in launch_agent
    result = agent.run()
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 736, in run
    result = self._invoke_run(role)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 871, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 705, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 549, in _rendezvous
    workers = self._assign_worker_ranks(store, group_rank, group_world_size, spec)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/metrics/api.py", line 124, in wrapper
    result = f(*args, **kwargs)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 637, in _assign_worker_ranks
    role_infos = self._share_and_gather(store, group_rank, group_world_size, spec)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/agent/server/api.py", line 674, in _share_and_gather
    role_infos_bytes = store_util.synchronize(
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 64, in synchronize
    agent_data = get_all(store, rank, key_prefix, world_size)
  File "/home/VictorUbuntu/.local/lib/python3.10/site-packages/torch/distributed/elastic/utils/store.py", line 34, in get_all
    data = store.get(f"{prefix}{idx}")
RuntimeError: Socket Timeout

Client Node

torchrun --nproc_per_node=1 --nnodes=2 --node_rank=1 --rdzv_id=456 rdzv_backend=c10d --rdzv_endpoint=192.168.***.***:7777 multinode.py 50 10

[E socket.cpp:957] [c10d] The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).
Traceback (most recent call last):
  File "/home/VictorPC/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/VictorPC/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: The client socket has timed out after 900s while trying to connect to (127.0.0.1, 29500).

So I’ve concluded that one problem was the port connection type. With that solved, what could be impeding the connection now?

TheVictor_777 · February 29, 2024, 2:23pm

Well… I solved this particular issue. The DDP port was open, but not specifically to TCP connections.