Need Help Solving DDP Connection Failures

TheVictor_777 · March 11, 2024, 7:52pm

Hello, I have been trying for weeks now to use DDP as a PyTorch training accelerator. Unfortunately, I have been met with a wall of failures. I am trying to run PyTorch’s DDP multi-node example, but with no success. I execute the following on my nodes:

TP_SOCKET_IFNAME=<Socket> NCCL_SOCKET_IFNAME=<Socket> GLOO_SOCKET_IFNAME=<Socket> time torchrun --nnodes=2 --nproc-per-node=1 --master-addr=<Master Address> --node_rank=<0 to 1> --master-port=7777 --start-method=spawn multinode.py 50 10

So far, I have experienced a barrage of failures. First, I discovered that I forgot to open port 7777 to TCP on both nodes. With that done, the client now spits this out while the server waits:

Traceback (most recent call last):
  File "/home/<User>/.local/bin/torchrun", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
    result = agent.run()
             ^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
    result = self._invoke_run(role)
             ^^^^^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
    self._rendezvous(worker_group)
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
    result = f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
                                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: Connection reset by peer
Command exited with non-zero status 1
1.92user 0.49system 0:01.96elapsed 123%CPU (0avgtext+0avgdata 378640maxresident)k
0inputs+8outputs (0major+51822minor)pagefaults 0swaps

By using ‘nc’ to test the port functionality on both nodes, I have confirmed that both PCs can connect to each other through port 7777. What could possibly be wrong? Any help is greatly appreciated!