Can I have different number of GPUs on different nodes while running distributed tasks?

imkzh · July 15, 2022, 8:11am

Sadly, I have 2 nodes, one with 3 gpus and another with 2 gpus, and I failed to run a distributed training with all of them.

What I have tried:

with --nnodes=2 --nproc_per_node=3 on one node and --nnodes=2 --nproc_per_node=2 on another.
Pytorch seems support this setup, the program successfully rendezvoused with global_world_sizes = [5,5,5] ([5,5] on another node), my training starts and then hang for ever (before dataloader, possibly on a barrier)

log(--nodes=2 --nproc_per_node=x) first node

/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : STIP_main.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 3
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.12.0.2:30001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.12.0.2
  master_port=30001
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1, 2]
  role_ranks=[0, 1, 2]
  global_ranks=[0, 1, 2]
  role_world_sizes=[5, 5, 5]
  global_world_sizes=[5, 5, 5]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/2/error.json
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 1): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1516058:1516058 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516058:1516058 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1516058:1516058 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516058:1516058 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516058:1516058 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1516059:1516059 [1] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516059:1516059 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1516059:1516059 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516059:1516059 [1] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516059:1516059 [1] NCCL INFO Using network Socket
Node0:1516060:1516060 [2] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516060:1516060 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1516060:1516060 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516060:1516060 [2] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516060:1516060 [2] NCCL INFO Using network Socket
Node0:1516060:1516101 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516059:1516100 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516060:1516101 [2] NCCL INFO Trees [0] -1/-1/-1->2->1|1->2->-1/-1/-1 [1] -1/-1/-1->2->1|1->2->-1/-1/-1
Node0:1516059:1516100 [1] NCCL INFO Trees [0] 2/3/-1->1->0|0->1->2/3/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
Node0:1516058:1516099 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4
Node0:1516058:1516099 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4
Node0:1516058:1516099 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516058:1516099 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->4|4->0->1/-1/-1
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516058:1516099 [0] NCCL INFO Channel 00 : 4[1000] -> 0[1000] [receive] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516058:1516099 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via direct shared memory
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 2[4000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Channel 00 : 2[4000] -> 3[2000] [send] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Channel 00 : 2[4000] -> 1[2000] via direct shared memory
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 3[2000] -> 1[2000] [receive] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 0[1000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 4[1000] -> 0[1000] [receive] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Channel 01 : 2[4000] -> 3[2000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 3[2000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Channel 01 : 1[2000] -> 2[4000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Channel 01 : 2[4000] -> 1[2000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 0[1000] -> 4[1000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Channel 01 : 1[2000] -> 0[1000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516058:1516099 [0] NCCL INFO comm 0x7f3d64002e10 rank 0 nranks 5 cudaDev 0 busId 1000 - Init COMPLETE
Node0:1516058:1516058 [0] NCCL INFO Launch mode Parallel
Node0:1516060:1516101 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516060:1516101 [2] NCCL INFO comm 0x7fd37c002e10 rank 2 nranks 5 cudaDev 2 busId 4000 - Init COMPLETE
Node0:1516059:1516100 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516059:1516100 [1] NCCL INFO comm 0x7f18c8002e10 rank 1 nranks 5 cudaDev 1 busId 2000 - Init COMPLETE
loading annotations into memory...
Done (t=1.09s)
creating index...
index created!
loading annotations into memory...
Done (t=1.05s)
creating index...
index created!

[Logger] DETR Arguments:
	lr: 5e-05
	lr_backbone: 1e-05
	lr_drop: 80
	frozen_weights: None
	backbone: resnet50
	dilation: False
	position_embedding: sine
	enc_layers: 6
	dec_layers: 6
	num_queries: 100
	dataset_file: vcoco

[Logger] DETR_HOI Arguments:
	hoi_dec_layers: 6
	hoi_nheads: 8
	hoi_dim_feedforward: 2048
	hoi_idx_loss_coef: 1
	hoi_act_loss_coef: 1
	hoi_eos_coef: 0.1
	object_threshold: 0

[Logger] Number of total params:  56240935

[Logger] Number of trainable params:  14716167
Loading detr weights from args.detr_weights=pretrained/detr-r50-e632da11.pth

log(--nodes=2 --nproc_per_node=x) second node

/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : STIP_main.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.12.0.2:30001
  rdzv_configs     : {'rank': 1, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.12.0.2
  master_port=30001
  group_rank=1
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[3, 4]
  global_ranks=[3, 4]
  role_world_sizes=[5, 5]
  global_world_sizes=[5, 5]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4/attempt_0/1/error.json
| distributed init (rank 3): env://
| distributed init (rank 4): env://
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node1:3786256:3786256 [1] NCCL INFO Bootstrap : Using [0]enp5s0:10.12.0.3<0>
Node1:3786257:3786257 [0] NCCL INFO Bootstrap : Using [0]enp5s0:10.12.0.3<0>
Node1:3786256:3786256 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node1:3786257:3786257 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node1:3786256:3786256 [1] NCCL INFO NET/IB : No device found.
Node1:3786257:3786257 [0] NCCL INFO NET/IB : No device found.
Node1:3786257:3786257 [0] NCCL INFO NET/Socket : Using [0]enp5s0:10.12.0.3<0>
Node1:3786256:3786256 [1] NCCL INFO NET/Socket : Using [0]enp5s0:10.12.0.3<0>
Node1:3786257:3786257 [0] NCCL INFO Using network Socket
Node1:3786256:3786256 [1] NCCL INFO Using network Socket
Node1:3786257:3786293 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node1:3786257:3786293 [0] NCCL INFO Trees [0] -1/-1/-1->4->3|3->4->-1/-1/-1 [1] 0/-1/-1->4->3|3->4->0/-1/-1
Node1:3786256:3786292 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node1:3786256:3786292 [1] NCCL INFO Trees [0] 4/-1/-1->3->1|1->3->4/-1/-1 [1] 4/-1/-1->3->-1|-1->3->4/-1/-1
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 2[4000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 3[2000] -> 4[1000] via direct shared memory
Node1:3786257:3786293 [0] NCCL INFO Channel 00 : 4[1000] -> 0[1000] [send] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786257:3786293 [0] NCCL INFO Channel 00 : 4[1000] -> 3[2000] via direct shared memory
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 3[2000] -> 1[2000] [send] via NET/Socket/0
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 4[1000] -> 0[1000] [send] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 1[2000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Channel 01 : 2[4000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786256:3786292 [1] NCCL INFO Channel 01 : 3[2000] -> 4[1000] via direct shared memory
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 0[1000] -> 4[1000] [receive] via NET/Socket/0
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 4[1000] -> 3[2000] via direct shared memory
Node1:3786256:3786292 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node1:3786256:3786292 [1] NCCL INFO comm 0x7f0804002e10 rank 3 nranks 5 cudaDev 1 busId 2000 - Init COMPLETE
Node1:3786257:3786293 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node1:3786257:3786293 [0] NCCL INFO comm 0x7fe758002e10 rank 4 nranks 5 cudaDev 0 busId 1000 - Init COMPLETE

with --nnodes=5 --nproc_per_node=1 on both nodes, but run 3 instances on first node, and 2 instances on another.
In this setup, the program throws an exception and hang, my program is not started.

log(--nnodes=5 --nproc_per_node=1)

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qmf3r7x5/none_wf___urt/attempt_0/0/error.json
| distributed init (rank 0): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1515236:1515236 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515236:1515236 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1515236:1515236 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515236:1515236 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515236:1515236 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1515236:1515389 [0] NCCL INFO Channel 00/02 :    0   1   2   3   4
Node0:1515236:1515389 [0] NCCL INFO Channel 01/02 :    0   1   2   3   4
Node0:1515236:1515389 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1515236:1515389 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->4|4->0->1/-1/-1
Node0:1515236:1515389 [0] NCCL INFO Channel 00 : 4[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515236:1515389 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via P2P/IPC
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying

Node0:1515236:1515389 [0] include/socket.h:403 NCCL WARN Connect to 10.12.0.2<58663> failed : Connection refused
Node0:1515236:1515389 [0] NCCL INFO bootstrap.cc:95 -> 2
Node0:1515236:1515389 [0] NCCL INFO bootstrap.cc:363 -> 2
Node0:1515236:1515389 [0] NCCL INFO transport.cc:59 -> 2
Node0:1515236:1515389 [0] NCCL INFO init.cc:766 -> 2
Node0:1515236:1515389 [0] NCCL INFO init.cc:840 -> 2
Node0:1515236:1515389 [0] NCCL INFO group.cc:73 -> 2 [Async thread]

Traceback (most recent call last):
  File "STIP_main.py", line 300, in <module>
    main(args)
  File "STIP_main.py", line 43, in main
    utils.init_distributed_mode(args)
  File "/home/train/projects/python/stip/src/util/misc.py", line 304, in init_distributed_mode
    torch.distributed.barrier()
  File "/home/train/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2524, in barrier
    work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.

no problem with --nnodes=2 --nproc_per_node=2 while wasting 1 gpu.

log(--nnodes=2 --nproc_per_node=2)

/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
  logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
  entrypoint       : STIP_main.py
  min_nodes        : 2
  max_nodes        : 2
  nproc_per_node   : 2
  run_id           : none
  rdzv_backend     : static
  rdzv_endpoint    : 10.12.0.2:30001
  rdzv_configs     : {'rank': 0, 'timeout': 900}
  max_restarts     : 3
  monitor_interval : 5
  log_dir          : None
  metrics_cfg      : {}

INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
  warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
  restart_count=0
  master_addr=10.12.0.2
  master_port=30001
  group_rank=0
  group_world_size=2
  local_ranks=[0, 1]
  role_ranks=[0, 1]
  global_ranks=[0, 1]
  role_world_sizes=[4, 4]
  global_world_sizes=[4, 4]

INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1/attempt_0/1/error.json
| distributed init (rank 1): env://
| distributed init (rank 0): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1515795:1515795 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515795:1515795 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1515795:1515795 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515795:1515795 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515795:1515795 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1515796:1515796 [1] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515796:1515796 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

Node0:1515796:1515796 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515796:1515796 [1] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515796:1515796 [1] NCCL INFO Using network Socket
Node0:1515796:1515826 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
Node0:1515796:1515826 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
Node0:1515795:1515825 [0] NCCL INFO Channel 00/02 :    0   1   2   3
Node0:1515795:1515825 [0] NCCL INFO Channel 01/02 :    0   1   2   3
Node0:1515795:1515825 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
Node0:1515795:1515825 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->3|3->0->1/-1/-1
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515795:1515825 [0] NCCL INFO Channel 00 : 3[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515795:1515825 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 1[2000] -> 2[1000] [send] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 2[1000] -> 1[2000] [receive] via NET/Socket/0
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 1[2000] -> 0[1000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 3[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2000] via direct shared memory
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515796:1515826 [1] NCCL INFO Channel 01 : 1[2000] -> 2[1000] [send] via NET/Socket/0
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515796:1515826 [1] NCCL INFO Channel 01 : 1[2000] -> 0[1000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1515796:1515826 [1] NCCL INFO comm 0x7faa90002e10 rank 1 nranks 4 cudaDev 1 busId 2000 - Init COMPLETE
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 0[1000] -> 3[2000] [send] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1515795:1515825 [0] NCCL INFO comm 0x7f5054002e10 rank 0 nranks 4 cudaDev 0 busId 1000 - Init COMPLETE
Node0:1515795:1515795 [0] NCCL INFO Launch mode Parallel
loading annotations into memory...
Done (t=1.01s)
creating index...
index created!
loading annotations into memory...
Done (t=1.02s)
creating index...
index created!

[Logger] DETR Arguments:
	lr: 5e-05
	lr_backbone: 1e-05
	lr_drop: 80
	frozen_weights: None
	backbone: resnet50
	dilation: False
	position_embedding: sine
	enc_layers: 6
	dec_layers: 6
	num_queries: 100
	dataset_file: vcoco

[Logger] DETR_HOI Arguments:
	hoi_dec_layers: 6
	hoi_nheads: 8
	hoi_dim_feedforward: 2048
	hoi_idx_loss_coef: 1
	hoi_act_loss_coef: 1
	hoi_eos_coef: 0.1
	object_threshold: 0

[Logger] Number of total params:  56240935

[Logger] Number of trainable params:  14716167
Loading detr weights from args.detr_weights=pretrained/detr-r50-e632da11.pth

>>> Epoch #1

no problem with --nnodes=1 --nproc_per_node=3 while wasting 1 node (2 gpus)

aazzolini · July 15, 2022, 6:02pm

@d4l3k could you advise on the recommended way to launch this program with 3 GPUs on one node and 2 GPUs on the other?

Levi_Ackerman · July 16, 2022, 9:25pm

I encountered the same issue before. Here is my solution:

Launch one process per node using torchrun.
Initialize the first global process group, and each process collects the number of available CUDA devices of the node with torch.cuda.device_count(), and all gather the device information to all nodes.
Delete the above process group. For each node spawn the equal number of new processes as the number of CUDA devices of the node. Carefully calculate the new RANK and WORLD_SIZE for each new process. For example, if node 1 has 2 GPUs and node 2 has 3 GPUs, RANK should be “0, 1 for node 1” and “2, 3, 4 for node 2” and the total WORLD_SIZE is 5. Then initialize the final process group.

ojipadeson · July 18, 2022, 8:01am

@Levi_Ackerman Could you give a running script? I have the same problem, but it’s hard for me to repeat your solution. Thank you very much!

Kiuk_Chung · July 18, 2022, 5:34pm

@imkzh seems like there is something in your training script that assumes homogeneous number of procs_per_node. Using a simple program that uses pytorch distributed to compute the world size by all-reducing a one-hot vector by rank (see the program here: torchx/torchx/examples/apps/compute_world_size at main · pytorch/torchx · GitHub) I’ve been able to validate that heterogeneous nproc per node works when using torchrun:

Running 2 nodes, 1 proc on the first and 2 procs on the second for a total world size of 3

$ LOGLEVEL=INFO torchrun 
    --nnodes 2 
    --nproc_per_node 1 
    --rdzv_backend c10d 
    --rdzv_endpoint localhost:29500 
    main.py

$ LOGLEVEL=INFO torchrun 
    --nnodes 2 
    --nproc_per_node 2
    --rdzv_backend c10d 
    --rdzv_endpoint localhost:29500 
    main.py

Running the two commands above yields an output like so:

rank: 1, actual world_size: 3, computed world_size: 3
rank: 2, actual world_size: 3, computed world_size: 3
------------------------------------------------------
rank: 0, actual world_size: 3, computed world_size: 3

Kiuk_Chung · July 18, 2022, 5:38pm

The above works equally well to GPUs (e.g. when the backend is nccl):

$ CUDA_VISIBLE_DEVICES=2 LOGLEVEL=INFO torchrun 
  --nnodes 2 
  --nproc_per_node 1 
  --rdzv_backend c10d 
  --rdzv_endpoint localhost:29500 
  main.py main.backend=nccl

$ CUDA_VISIBLE_DEVICES=0,1 LOGLEVEL=INFO torchrun 
  --nnodes 2 
  --nproc_per_node 2
  --rdzv_backend c10d 
  --rdzv_endpoint localhost:29500 
  main.py main.backend=nccl