TCPStore( RuntimeError: connect() timed out

lyxz · January 10, 2022, 3:07am

Hello guys,
when I tried to use ip(10.100.4.219) or hostname(training_machine0) to init ddp, the connect timed out. But if I change the ip to 127.0.0.1(localhost), the init works. However, I can ping 10.100.4.219 and training_machine0 successfully.
python:3.8.12
GPU:RTX3090 * 8 * 2 nodes
torch:1.9.0+cu111
running in docker: with --network host
running command: python -m torch.distributed.run --nnodes=2 --rdzv_id=6666 --rdzv_backend=c10d --nproc_per_node=8 --rdzv_endpoint=training_machine0:29400 -m develop.A_my_training_group.cluster_train

Anything wrong with my settings or how can I fix it ?

More details: if I set the rdzv_endpoint to 127.0.0.1 or localhost, everything goes well, but if I set the
rdzv_endpoint to training_machine0 or 10.100.4.219(local node ip) , connection timed out.

ping training_machine0:

logs:

[INFO] 2022-01-10 14:53:21,159 run: Running torch.distributed.run with args: ['/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py', '--nnodes=2', '--rdzv_id=6666', '--rdzv_backend=c10d', '--nproc_per_node=8', '--rdzv_endpoint=training_machine0:29400', '-m', 'develop.A_my_training_group.cluster_train']
[INFO] 2022-01-10 14:53:21,161 run: Using nproc_per_node=8.
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
[INFO] 2022-01-10 14:53:21,161 api: Starting elastic_operator with launch configs:
  entrypoint       : develop.A_my_training_group.cluster_train
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 8
  run_id           : 6666
  rdzv_backend     : c10d
  rdzv_endpoint    : training_machine0:29400
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

[ERROR] 2022-01-10 14:54:21,182 error_handler: {
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 145, in _create_tcp_store\n    store = TCPStore(  # type: ignore[call-arg]\nRuntimeError: connect() timed out.\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 348, in wrapper\n    return f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 214, in launch_agent\n    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 64, in get_rendezvous_handler\n    return handler_registry.create_handler(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py\", line 253, in create_handler\n    handler = creator(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 35, in _create_c10d_handler\n    backend, store = create_backend(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 204, in create_backend\n    store = _create_tcp_store(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 163, in _create_tcp_store\n    raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1641797661"
    }
  }
}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 145, in _create_tcp_store
    store = TCPStore(  # type: ignore[call-arg]
RuntimeError: connect() timed out.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 637, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 629, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 621, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 348, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 214, in launch_agent
    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
    handler = creator(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 204, in create_backend
    store = _create_tcp_store(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 163, in _create_tcp_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

Thank you!

pritamdamania87 · January 10, 2022, 9:03pm

Since rdvz_endpoint is training_machine0:29400, could you check that port 29400 is open between the two machines? Even if ping is working, it is possible that a firewall is blocking that port causing TCP to fail.

cc @Kiuk_Chung @aivanou

lyxz · January 11, 2022, 6:25am

Hi @pritamdamania87 ,
29400 is working because when I was using only one machine with hostname or ip(10.100.4.219 not 127.0.0.1),connection still timed out.
But I tried to downgrade pytorch version from 1.9.0 to 1.7.0, with almost the same settings, and used old torch.distributed.launch command, the two nodes can do ddp train finally(2 times slower than only one node).

aivanou · January 11, 2022, 8:31pm

Hello! Can you please give more info about your environment, dockerfile, port openings between hosts and whether there any firewalls?

I tried to repro your use-case and used the following environment:
2 ec2 servers in the same VPC and subnet. All ports are open between them.
The docker file is:

FROM pytorch/pytorch:1.9.0-cuda10.2-cudnn7-runtime

WORKDIR /app

COPY . /app

The command that I am running:

docker run --network host -it aivanou-test1 /bin/bash

python -m torch.distributed.run --rdzv_id 555 --rdzv_backend c10d --rdzv_endpoint 172.31.25.111:29400 --nnodes 2 simple.py

The 172.31.25.111 is the ip address of the host0.

The following works for pytorch 1.9 and 1.10

lyxz · January 12, 2022, 7:30am

Hello!
The base image is:

FROM determinedai/environments:cuda-11.1-pytorch-1.9-lightning-1.3-tf-2.4-gpu-0.17.2

Nvidia-Driver:495.46
GPU: RTX 3090 * 8 * 2nodes
My docker run command:

docker run -itd --gpus all --name ly_env --ulimit core=0 --shm-size=“16g” --network=host torch_env:dev

The two hosts can contact with each other: I can ssh from one to the other without password,I both added their authorized host.

I think there isn’t any firewalls because all ports can be accessed.

Thank you!

aivanou · January 14, 2022, 5:51pm

Can you please try to repro the issue directly on the host?

For example, you can use the following commands:

pip install torch==1.9.0

#simple.py
print("hi")

# run on host:
python -m torch.distributed.run --rdzv_id 555 --rdzv_backend c10d --rdzv_endpoint IP_OF_MACHINE_0:29400 --nnodes 2 --nproc_per_node 2 simple.py

If you were able to repro the issue, can you do the following:

Run python -m torch.distributed.run --rdzv_id 555 --rdzv_backend c10d --rdzv_endpoint IP_OF_MACHINE_0:29400 --nnodes 2 --nproc_per_node 2 simple.py on training_machine0, then on the second host use the following cmd:

traceroute -T -p 29400 10.100.4.219

to check whether another host can access training_machine0.

Also, if you want faster iteration on this issue, you can DM me in slack: csivanou at gmail dot com

rakhianand · April 21, 2022, 4:03pm

I am running Pytorch on 384 nodes(1536) ranks and getting same error.
It runs fine o 256 nodes(1024) ranks and it runs fine

I tried with 2 version pytorch 1.11.0, and 1.10.0.
I installed it using anaconda
conda install -c conda-forge pytorch-gpu

cbalioglu · May 2, 2022, 7:12pm

Hey @rakhianand, you should be getting a much more descriptive error message with v1.11 instead of a “connect() timed out”. Do you mind sharing your error output?

Loubna_ben_allal · May 25, 2023, 3:22pm

Hi, If I launch a job without specifying the --rdv_backend, which backend is used by default?
I am running a training with this lancher on multiple nodes

export LAUNCHER="torchrun \
     --nproc_per_node $GPUS_PER_NODE \
     --nnodes $NNODES \
     --rdzv_endpoint $MASTER_ADDR:$MASTER_PORT \
    --rdzv_backend c10d \
     --max_restarts 0 \
     --tee 3 \
     "

But it fails with error:

torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.
[E socket.cpp:860] [c10d] The client socket has timed out after 60s while trying to connect to (MASTER ADDR, Port)

If I remove --rdzv_backend c10d the training runs successfully (also note that the nodes don’t have access to internet) is there a reason this causes failure and will removing this flag impact my training in any way?