DDP Multinode tutorial example failing

RM13 · July 24, 2024, 4:14pm

I am using the code from this tutorial (examples/distributed/ddp-tutorial-series/multinode.py at main · pytorch/examples · GitHub) with the following SLURM submission script

#!/bin/bash

#SBATCH -N 2
#SBATCH --gres=gpu:volta:1
#SBATCH -c 10

source /etc/profile.d/modules.sh

module load anaconda/2023a
module load cuda/11.6
module load nccl/2.11.4-cuda11.6

nodes=( $( scontrol show hostnames $SLURM_JOB_NODELIST ) )
nodes_array=($nodes)
head_node=${nodes_array[0]}
head_node_ip=$(srun --nodes=1 --ntasks=1 -w “$head_node” hostname --ip-address)

echo Node IP: $head_node_ip
export LOGLEVEL=INFO
export NCCL_DEBUG=INFO

srun torchrun
–nnodes 2
–nproc_per_node 1
–rdzv_id $RANDOM
–rdzv_backend c10d
–rdzv_endpoint $head_node_ip:29503
multi_tutorial.py 50 10

However this gives the following error

Node IP: 172.31.130.84
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /home/gridsan/rmehta/potential_function/multi_tutorial.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 1
  run_id           : 7644
  rdzv_backend     : c10d
  rdzv_endpoint    : 172.31.130.84:29503
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : /home/gridsan/rmehta/potential_function/multi_tutorial.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 1
  run_id           : 7644
  rdzv_backend     : c10d
  rdzv_endpoint    : 172.31.130.84:29503
  rdzv_configs     : {'timeout': 900}
  max_restarts     : 0
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
[W socket.cpp:426] [c10d] The server socket has failed to listen on [::]:29503 (errno: 98 - Address already in use).
[W socket.cpp:426] [c10d] The server socket has failed to bind to 0.0.0.0:29503 (errno: 98 - Address already in use).
[E socket.cpp:462] [c10d] The server socket has failed to listen on any local network address.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3.9
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=d-9-11-1.supercloud.mit.edu
  master_port=58139
  group_rank=0
  group_world_size=2
  local_ranks=[0]
  role_ranks=[0]
  global_ranks=[0]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=d-9-11-1.supercloud.mit.edu
  master_port=58139
  group_rank=1
  group_world_size=2
  local_ranks=[0]
  role_ranks=[1]
  global_ranks=[1]
  role_world_sizes=[2]
  global_world_sizes=[2]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:Environment variable 'TORCHELASTIC_ENABLE_FILE_TIMER' not found. Do not start FileTimerServer.
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.1/torchelastic_00qy3rwa/7644__t3nqnre/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /state/partition1/slurm_tmp/26645921.1.0/torchelastic_sxs5m6o3/7644_v74ll5_1/attempt_0/0/error.json
d-9-11-1:2870757:2870757 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870757 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870757:2870757 [0] NCCL INFO cudaDriverVersion 12020
NCCL version 2.14.3+cuda11.6
d-9-11-1:2870756:2870756 [0] NCCL INFO cudaDriverVersion 12020
d-9-11-1:2870756:2870756 [0] NCCL INFO Bootstrap : Using ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870756 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
d-9-11-1:2870756:2870934 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870756:2870934 [0] NCCL INFO Using network IB
d-9-11-1:2870757:2870933 [0] NCCL INFO NET/IB : Using [0]mlx5_0:1/RoCE [RO]; OOB ens2f0:172.31.130.84<0>
d-9-11-1:2870757:2870933 [0] NCCL INFO Using network IB

d-9-11-1:2870757:2870933 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
d-9-11-1:2870757:2870933 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870757:2870933 [0] NCCL INFO group.cc:64 -> 5 [Async thread]

d-9-11-1:2870756:2870934 [0] init.cc:525 NCCL WARN Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
d-9-11-1:2870756:2870934 [0] NCCL INFO init.cc:1089 -> 5
d-9-11-1:2870756:2870934 [0] NCCL INFO group.cc:64 -> 5 [Async thread]
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870756:2870756 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:421 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO group.cc:106 -> 3
d-9-11-1:2870757:2870757 [0] NCCL INFO comm 0x560c6a8aafd0 rank 0 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
d-9-11-1:2870756:2870756 [0] NCCL INFO comm 0x55592676f080 rank 1 nranks 2 cudaDev 0 busId 86000 - Abort COMPLETE
Traceback (most recent call last):
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
Traceback (most recent call last):
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 113, in <module>
    main(args.save_every, args.total_epochs, args.batch_size)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
    self.model = DDP(self.model, device_ids=[self.local_rank])
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    main(args.save_every, args.total_epochs, args.batch_size)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 100, in main
    trainer = Trainer(model, train_data, optimizer, save_every, snapshot_path)
  File "/home/gridsan/rmehta/potential_function/multi_tutorial.py", line 39, in __init__
    self.model = DDP(self.model, device_ids=[self.local_rank])
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 655, in __init__
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    _verify_param_shape_across_processes(self.process_group, parameters)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/utils.py", line 112, in _verify_param_shape_across_processes
    return dist._verify_params_across_processes(process_group, tensors, logger)
    return dist._verify_params_across_processes(process_group, tensors, logger)
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 1 and rank 0 both on CUDA device 86000
RuntimeError: NCCL error in: ../torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3
ncclInternalError: Internal check failed.
Last error:
Duplicate GPU detected : rank 0 and rank 1 both on CUDA device 86000
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870757) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 2870756) of binary: /state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/python3.9
INFO:torch.distributed.elastic.agent.server.api:Local worker group finished (FAILED). Waiting 300 seconds for other agents to finish
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.00014710426330566406 seconds
INFO:torch.distributed.elastic.agent.server.api:Done waiting for other agents. Elapsed: 0.0004813671112060547 seconds
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 0 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
INFO:torch.distributed.elastic.multiprocessing.errors:local_rank 1 FAILED with no error file. Decorate your entrypoint fn with @record for traceback info. See: https://pytorch.org/docs/stable/elastic/errors.html
Traceback (most recent call last):
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
Traceback (most recent call last):
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/bin/torchrun", line 8, in <module>
    sys.exit(main())
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    sys.exit(main())
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 346, in wrapper
    return f(*args, **kwargs)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    return f(*args, **kwargs)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 762, in main
    run(args)
    run(args)
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/run.py", line 753, in run
    elastic_launch(
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    elastic_launch(
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 132, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    return launch_agent(self._config, self._entrypoint, list(args))
  File "/state/partition1/llgrid/pkg/anaconda/anaconda3-2023a/lib/python3.9/site-packages/torch/distributed/launcher/api.py", line 246, in launch_agent
    raise ChildFailedError(
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-24_11:39:43
  host      : d-9-11-1.supercloud.mit.edu
  rank      : 0 (local_rank: 0)
  exitcode  : 1 (pid: 2870757)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html
============================================================
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
/home/gridsan/rmehta/potential_function/multi_tutorial.py FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2024-07-24_11:39:43
  host      : d-9-11-1.supercloud.mit.edu
  rank      : 1 (local_rank: 0)
  exitcode  : 1 (pid: 2870756)
  error_file: <N/A>
  traceback : To enable traceback see: https://pytorch.org/docs/stable/elastic/errors.html

I am unsure whether the problem is that it is failing to connect to the port and then this causes the downstream error of the different processes trying to use the same GPU, or if they are two separate errors. I have tried many different ports and it still fails to connect each time. I would appreciate any help on what I might be doing wrong here. Thank you!

syed-ahmed · July 25, 2024, 1:00am

From the log, it seems like the port 29503 is already in use. You might need to kill all the “zombie” processes that are using up the ports. You can get the pid using lsof, e.g. lsof -i :29503 and then kill the process, e.g. kill -9 <pid>.

Also what happens if you try --node-rank, --master-addr, and --master-port, instead of the --rdzv_* options?