Pytorch DDP across nodes: self._store = TCPStore( # type: ignore[call-arg] RuntimeError: Stop_waiting response is expected

njukenanli · November 16, 2023, 8:23am

Hello, everyone. Our model has been training successfully with DDP on 4 GPUS on one node. I must thank friends on the forum for their help.

Now I am working on training on multiple nodes (ports), and I have no experience about this. We rent 4 computing nodes from an outside company, and they provide me an IP of the server, 4 port numbers with 4 GPUs each, and a password. I used Autorized_keys to set up a password-free login across nodes.

After I read some instructions online, I use the following codes and get the error. I am not sure whether the problem is due to the need to enter the login password, but I’ve set the authorized_keys. Or is it due to other problems? And Is it a necessity for me to learn to use other platforms or softwares like slurm?

            set_start_method("forkserver")
            parser = argparse.ArgumentParser()
            parser.add_argument('--local_rank', default=-1, type=int, help='node rank for distributed training')
            args = parser.parse_args()
            torch.cuda.set_device(args.local_rank)
            torch.cuda.empty_cache()
            torch.distributed.init_process_group(backend="nccl", init_method='env://')
            device = torch.device('cuda', args.local_rank)
            self.BartNN = self.BartNN.to(device)
            self.BartNN = torch.nn.parallel.DistributedDataParallel(self.BartNN,
                    device_ids=[args.local_rank],output_device=args.local_rank)

I just enter the command respectively on different ports:

NCCL_IB_DISABLE=1  python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=0 --master_addr="***.**.**.**" --master_port=37692 main.py
NCCL_IB_DISABLE=1  python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=1 --master_addr="***.**.**.**" --master_port=37692 main.py 
NCCL_IB_DISABLE=1  python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=2 --master_addr="***.**.**.**" --master_port=37692 main.py
NCCL_IB_DISABLE=1  python -m torch.distributed.launch --nproc_per_node=4 --nnodes=4 --node_rank=3 --master_addr="***.**.**.**" --master_port=37692 main.py

(Here to disable IB is due to a previous problem Pytorch DDP NCCL Error : Call to ibv_reg_mr failed with error Cannot allocate memory - distributed - PyTorch Forums)

Below is the error information.
For node0:

  warnings.warn(
WARNING:torch.distributed.run:
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
Traceback (most recent call last):
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
    result = self._invoke_run(role)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 844, in _invoke_run
    self._initialize_workers(self._worker_group)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 678, in _initialize_workers
    self._rendezvous(worker_group)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 538, in _rendezvous
    store, group_rank, group_world_size = spec.rdzv_handler.next_rendezvous()
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/rendezvous/static_tcp_rendezvous.py", line 55, in next_rendezvous
    self._store = TCPStore(  # type: ignore[call-arg]
RuntimeError: Stop_waiting response is expected

For node1:

Traceback (most recent call last):
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 194, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 195, in <module>
    main()
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 191, in main
    launch(args)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launch.py", line 176, in launch
    run(args)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/launcher/api.py", line 237, in launch_agent
    result = agent.run()
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/metrics/api.py", line 129, in wrapper
    result = f(*args, **kwargs)
  File "/data/data/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/distributed/elastic/agent/server/api.py", line 709, in run
 Aborted (core dumped)