TorchElastic: Connection reset by peer

I’ve just got my hands on two workstations with a pair of GPUs each and I have been trying to run distributed training across them both.
Training works on a singular machine with both GPUs active, but I’ve be unsuccessful in utilizing both. I keep getting RuntimeError: Connection reset by peer and I’m not entirely sure what to do (full error last paragraph).

I’m running on a docker container based off nvcr.io/nvidia/pytorch:21.07-py3 with an example of the run command below, the host system is Ubuntu 20.04 server (Docker version 20.10.7, build f0df350). Both systems are just connected to the University network (ports are adjacent to each other, but obivously, who knows if they’re connected to the same switch or any other firewall settings).

docker run -d -it --gpus all --shm-size 16G -p 29400:29400 \
    --mount type=bind,source=/datasets,target=/tmp/training_data,readonly \
   image:tag -m torch.distributed.run --nnodes=2 --nproc_per_node=2 \
    --rdzv_id='1234' --rdzv_backend='c10d' --rdzv_endpoint='system_a_ip' \
    train_detached.py --backend 'nccl' --other-args...

The firewall is disabled on both systems

sudo ufw status
Status: inactive
[ERROR] 2021-08-02 08:12:35,300 error_handler: {
  "message": {
    "message": "RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.",
    "extraInfo": {
      "py_callstack": "Traceback (most recent call last):\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 104, in _call_store\n    return getattr(self._store, store_op)(*args, **kwargs)\nRuntimeError: Connection reset by peer\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py\", line 351, in wrapper\n    return f(*args, **kwargs)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py\", line 214, in launch_agent\n    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 64, in get_rendezvous_handler\n    return handler_registry.create_handler(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py\", line 253, in create_handler\n    handler = creator(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py\", line 35, in _create_c10d_handler\n    backend, store = create_backend(params)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 239, in create_backend\n    return C10dRendezvousBackend(store, params.run_id), store\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 55, in __init__\n    self._call_store(\"compare_set\", self._key, \"\", self._NULL_SENTINEL)\n  File \"/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py\", line 106, in _call_store\n    raise RendezvousConnectionError(\ntorch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.\n",
      "timestamp": "1627891955"
    }
  }
}
Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 104, in _call_store
    return getattr(self._store, store_op)(*args, **kwargs)
RuntimeError: Connection reset by peer

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/opt/conda/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 638, in <module>
    main()
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 630, in main
    run(args)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/run.py", line 622, in run
    elastic_launch(
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 116, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 351, in wrapper
    return f(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 214, in launch_agent
    rdzv_handler = rdzv_registry.get_rendezvous_handler(rdzv_parameters)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 64, in get_rendezvous_handler
    return handler_registry.create_handler(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/api.py", line 253, in create_handler
    handler = creator(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/registry.py", line 35, in _create_c10d_handler
    backend, store = create_backend(params)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 239, in create_backend
    return C10dRendezvousBackend(store, params.run_id), store
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 55, in __init__
    self._call_store("compare_set", self._key, "", self._NULL_SENTINEL)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/c10d_rendezvous_backend.py", line 106, in _call_store
    raise RendezvousConnectionError(
torch.distributed.elastic.rendezvous.api.RendezvousConnectionError: The connection to the C10d store has failed. See inner exception for details.

So yeah, unless I’m blind, I don’t see any docs saying that don’t use --rdzv_endpoint=$ENDPOINT on the actual endpoint node, now that’s out of the way I’m kind of a step closer to things working. However now either the host or the slave complains ValueError: host not found: Temporary failure in name resolution. I have tried launch in two different ways by either using --network host or exposing the default port 29400. Either the host or the slave crashes with the error below.

Traceback (most recent call last):
  File "train_detached.py", line 19, in <module>
    run_training()
  File "train_detached.py", line 11, in run_training
    config_dict = initialise_training_modules()
  File "/workspace/nnet_training/training_init.py", line 208, in initialise_training_modules
    dist.init_process_group(backend=args.backend, init_method=args.dist_method)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 559, in init_process_group
    store, rank, world_size = next(rendezvous_iterator)
  File "/opt/conda/lib/python3.8/site-packages/torch/distributed/rendezvous.py", line 201, in _env_rendezvous_handler
    store = TCPStore(  # type: ignore[call-arg]
ValueError: host not found: Temporary failure in name resolution

When I use docker run ... --network host ... the first rendezvous is below where both the master and slave agree. However the computer name of the master_addr is that of the slave which is at 130.194.133.55??? You will note in the slave launch args it has the intended endpoint and the master launch args has no endpoint.

[INFO] 2021-08-02 12:23:42,863 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=mu00186362
  master_port=43937
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[2, 3]
  global_ranks=[2, 3]
  role_world_sizes=[4, 4]
  global_world_sizes=[4, 4]

If I instead expose the default port only docker run ... -p 29400:29400 ..., the first rendezvous is agreed upon as

[INFO] 2021-08-02 12:38:32,335 api: [default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=4ebda5ed7386
  master_port=52585
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[2, 3]
  global_ranks=[2, 3]
  role_world_sizes=[4, 4]
  global_world_sizes=[4, 4]

Where the slave has the launch config

[INFO] 2021-08-02 12:23:40,064 api: Starting elastic_operator with launch configs:
  entrypoint       : train_detached.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 2
  run_id           : 1234
  rdzv_backend     : c10d
  rdzv_endpoint    : 130.194.129.201
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

and the master has the launch config

[INFO] 2021-08-02 12:23:41,712 api: Starting elastic_operator with launch configs:
  entrypoint       : train_detached.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 2
  run_id           : 1234
  rdzv_backend     : c10d
  rdzv_endpoint    : 
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

I ran into exactly the same error: connection reset by peer. May I ask how did you finally solve it?

I’ve made the machines into a kubernetes cluster and use elastic controller for distributed training. Its depricated but still works fine on Kubernetes 1.21.