Init_process group times out when using two nodes

AditMeh · March 26, 2024, 11:16pm

Hardware/Software information:

PyTorch version is 2.2.1
The nodes are connected via 10 gig ethernet (no Infiniband)
I’ve tested that the nodes can ping each other and have also been able to use netcat (to test TCP) to send strings between nodes
I’m using NCCL in init_process group

Test script:

import torch.distributed as dist
import os 
import datetime

if __name__ == "__main__":
    rank = int(os.environ["RANK"])
    device = f'cuda:{rank}'
    print(f'My rank is: {rank}')
    timeout = datetime.timedelta(seconds=90)
    dist.init_process_group(backend="nccl", timeout=timeout)
    dist.barrier()
    print(f'node with rank {rank} successfully init process group!')
    if rank == 0:
        print(f'World size is: {os.environ["WORLD_SIZE"]}')

Torchrun Commands:

Master node: LOGLEVEL=INFO && torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=localhost test.py

Child node: LOGLEVEL=INFO && torchrun --nnodes=2 --nproc_per_node=1 --rdzv_id=456 --rdzv_backend=c10d --rdzv_endpoint=<IP of master node> test.py

Output logs on master:

[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO] Starting elastic_operator with launch configs:
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   entrypoint: test.py
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   min_nodes: 2
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   max_nodes: 2
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   nproc_per_node   : 1
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   run_id: 456
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   rdzv_backend     : c10d
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   rdzv_endpoint    : localhost
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   rdzv_configs     : {'timeout': 900}
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   max_restarts     : 0
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   monitor_interval : 5
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   log_dir: None
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]   metrics_cfg      : {}
[2024-03-26 19:03:16,881] torch.distributed.launcher.api: [INFO]
[2024-03-26 19:03:16,888] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] log directory set to: /tmp/torchelastic_j320gcba/456_mioz5c3q
[2024-03-26 19:03:16,889] torch.distributed.elastic.agent.server.api: [INFO] [default] starting workers for entrypoint: python
[2024-03-26 19:03:16,889] torch.distributed.elastic.agent.server.api: [INFO] [default] Rendezvous'ing worker group
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO] [default] Rendezvous complete for workers. Result:
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   restart_count=0
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   master_addr=localhost
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   master_port=45363
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   group_rank=0
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   group_world_size=2
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   local_ranks=[0]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   role_ranks=[0]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   global_ranks=[0]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   role_world_sizes=[2]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]   global_world_sizes=[2]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO]
[2024-03-26 19:03:33,082] torch.distributed.elastic.agent.server.api: [INFO] [default] Starting worker group
[2024-03-26 19:03:33,083] torch.distributed.elastic.agent.server.local_elastic_agent: [INFO] Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
[2024-03-26 19:03:33,084] torch.distributed.elastic.multiprocessing: [INFO] Setting worker0 reply file to: /tmp/torchelastic_j320gcba/456_mioz5c3q/attempt_0/0/error.json
My rank is: 0
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    dist.init_process_group(backend="nccl", timeout=timeout)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
torch.distributed.DistStoreError: Timed out after 91 seconds waiting for clients. 1/2 clients joined.
[2024-03-26 19:05:08,193] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 11111) of binary: /h/aditmeh/.conda/envs/torch/bin/python
[2024-03-26 19:05:08,207] torch.distributed.elastic.multiprocessing.errors: [INFO] ('local_rank %s FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html', 0)
Traceback (most recent call last):
  File "/h/aditmeh/.conda/envs/torch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

Output logs on child:

My rank is: 1
[E socket.cpp:957] [c10d] The client socket has timed out after 90s while trying to connect to (localhost, 45363).
Traceback (most recent call last):
  File "test.py", line 10, in <module>
    dist.init_process_group(backend="nccl", timeout=timeout)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/c10d_logger.py", line 86, in wrapper
    func_return = func(*args, **kwargs)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 1177, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 246, in _env_rendezvous_handler
    store = _create_c10d_store(master_addr, master_port, rank, world_size, timeout, use_libuv)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 174, in _create_c10d_store
    return TCPStore(
torch.distributed.DistNetworkError: The client socket has timed out after 90s while trying to connect to (localhost, 45363).
[2024-03-26 19:05:02,185] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 0 (pid: 5522) of binary: /h/aditmeh/.conda/envs/torch/bin/python
[2024-03-26 19:05:07,122] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'slurm3_5519_0' has failed to send a keep-alive heartbeat to the rendezvous '456' due to an error of type RendezvousConnectionError.
[2024-03-26 19:05:12,136] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'slurm3_5519_0' has failed to send a keep-alive heartbeat to the rendezvous '456' due to an error of type RendezvousConnectionError.
[2024-03-26 19:05:17,150] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'slurm3_5519_0' has failed to send a keep-alive heartbeat to the rendezvous '456' due to an error of type RendezvousConnectionError.
[2024-03-26 19:05:22,165] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'slurm3_5519_0' has failed to send a keep-alive heartbeat to the rendezvous '456' due to an error of type RendezvousConnectionError.
[2024-03-26 19:05:26,628] torch.distributed.elastic.rendezvous.dynamic_rendezvous: [WARNING] The node 'slurm3_5519_0' has failed to shutdown the rendezvous '456' due to an error of type RendezvousConnectionError.
Traceback (most recent call last):
  File "/h/aditmeh/.conda/envs/torch/bin/torchrun", line 33, in <module>
    sys.exit(load_entry_point('torch==2.2.1', 'console_scripts', 'torchrun')())
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 347, in wrapper
    return f(*args, **kwargs)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 812, in main
    run(args)
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/run.py", line 803, in run
    elastic_launch(
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 135, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/h/aditmeh/.conda/envs/torch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 268, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
test.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------

I’m not sure exactly what’s going wrong and causing the timeout, but it seems like this child node is trying to connect to localhost when it really should be connecting to the IP of the master node, but that’s strange because I’m providing the IP of the master node in the torchrun command of the child. So I have no clue what could be causing this.

Any help would be appreciated!