Hello, I have been trying for weeks now to use DDP as a PyTorch training accelerator. Unfortunately, I have been met with a wall of failures. I am trying to run PyTorch’s DDP multi-node example, but with no success. I execute the following on my nodes:
TP_SOCKET_IFNAME=<Socket> NCCL_SOCKET_IFNAME=<Socket> GLOO_SOCKET_IFNAME=<Socket> time torchrun --nnodes=2 --nproc-per-node=1 --master-addr=<Master Address> --node_rank=<0 to 1> --master-port=7777 --start-method=spawn multinode.py 50 10
So far, I have experienced a barrage of failures. First, I discovered that I forgot to open port 7777 to TCP on both nodes. With that done, the client now spits this out while the server waits:
Traceback (most recent call last):
File "/home/<User>/.local/bin/torchrun", line 8, in <module>
sys.exit(main())
^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 812, in main
run(args)
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/run.py", line 803, in run
elastic_launch(
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 259, in launch_agent
result = agent.run()
^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 727, in run
result = self._invoke_run(role)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 862, in _invoke_run
self._initialize_workers(self._worker_group)
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 699, in _initialize_workers
self._rendezvous(worker_group)
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/metrics/api.py", line 123, in wrapper
result = f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/agent/server/api.py", line 542, in _rendezvous
store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/<User>/.local/lib/python3.12/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
self._store = TCPStore( # type: ignore[call-arg]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistNetworkError: Connection reset by peer
Command exited with non-zero status 1
1.92user 0.49system 0:01.96elapsed 123%CPU (0avgtext+0avgdata 378640maxresident)k
0inputs+8outputs (0major+51822minor)pagefaults 0swaps
By using ‘nc’ to test the port functionality on both nodes, I have confirmed that both PCs can connect to each other through port 7777. What could possibly be wrong? Any help is greatly appreciated!