Sadly, I have 2 nodes, one with 3 gpus and another with 2 gpus, and I failed to run a distributed training with all of them.
What I have tried:
- with
--nnodes=2 --nproc_per_node=3
on one node and--nnodes=2 --nproc_per_node=2
on another.
Pytorch seems support this setup, the program successfully rendezvoused withglobal_world_sizes = [5,5,5] ([5,5] on another node)
, my training starts and then hang for ever (before dataloader, possibly on a barrier)
log(--nodes=2 --nproc_per_node=x) first node
/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : STIP_main.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 3
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.12.0.2:30001
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.12.0.2
master_port=30001
group_rank=0
group_world_size=2
local_ranks=[0, 1, 2]
role_ranks=[0, 1, 2]
global_ranks=[0, 1, 2]
role_world_sizes=[5, 5, 5]
global_world_sizes=[5, 5, 5]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/1/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker2 reply file to: /tmp/torchelastic_d9xbyhht/none_o0pk7dkt/attempt_0/2/error.json
| distributed init (rank 0): env://
| distributed init (rank 2): env://
| distributed init (rank 1): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 2 using best-guess GPU 2 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1516058:1516058 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516058:1516058 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1516058:1516058 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516058:1516058 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516058:1516058 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1516059:1516059 [1] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516059:1516059 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1516059:1516059 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516059:1516059 [1] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516059:1516059 [1] NCCL INFO Using network Socket
Node0:1516060:1516060 [2] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1516060:1516060 [2] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1516060:1516060 [2] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1516060:1516060 [2] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1516060:1516060 [2] NCCL INFO Using network Socket
Node0:1516060:1516101 [2] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516059:1516100 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516060:1516101 [2] NCCL INFO Trees [0] -1/-1/-1->2->1|1->2->-1/-1/-1 [1] -1/-1/-1->2->1|1->2->-1/-1/-1
Node0:1516059:1516100 [1] NCCL INFO Trees [0] 2/3/-1->1->0|0->1->2/3/-1 [1] 2/-1/-1->1->0|0->1->2/-1/-1
Node0:1516058:1516099 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
Node0:1516058:1516099 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
Node0:1516058:1516099 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1516058:1516099 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->4|4->0->1/-1/-1
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516058:1516099 [0] NCCL INFO Channel 00 : 4[1000] -> 0[1000] [receive] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516058:1516099 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via direct shared memory
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 2[4000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Channel 00 : 2[4000] -> 3[2000] [send] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Channel 00 : 2[4000] -> 1[2000] via direct shared memory
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 3[2000] -> 1[2000] [receive] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 0[1000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 4[1000] -> 0[1000] [receive] via NET/Socket/0
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Channel 01 : 2[4000] -> 3[2000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Channel 00 : 1[2000] -> 3[2000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Channel 01 : 1[2000] -> 2[4000] via direct shared memory
Node0:1516060:1516101 [2] NCCL INFO Could not enable P2P between dev 2(=4000) and dev 1(=2000)
Node0:1516060:1516101 [2] NCCL INFO Channel 01 : 2[4000] -> 1[2000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 2(=4000)
Node0:1516059:1516100 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1516058:1516099 [0] NCCL INFO Channel 01 : 0[1000] -> 4[1000] [send] via NET/Socket/0
Node0:1516059:1516100 [1] NCCL INFO Channel 01 : 1[2000] -> 0[1000] via direct shared memory
Node0:1516058:1516099 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516058:1516099 [0] NCCL INFO comm 0x7f3d64002e10 rank 0 nranks 5 cudaDev 0 busId 1000 - Init COMPLETE
Node0:1516058:1516058 [0] NCCL INFO Launch mode Parallel
Node0:1516060:1516101 [2] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516060:1516101 [2] NCCL INFO comm 0x7fd37c002e10 rank 2 nranks 5 cudaDev 2 busId 4000 - Init COMPLETE
Node0:1516059:1516100 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1516059:1516100 [1] NCCL INFO comm 0x7f18c8002e10 rank 1 nranks 5 cudaDev 1 busId 2000 - Init COMPLETE
loading annotations into memory...
Done (t=1.09s)
creating index...
index created!
loading annotations into memory...
Done (t=1.05s)
creating index...
index created!
[Logger] DETR Arguments:
lr: 5e-05
lr_backbone: 1e-05
lr_drop: 80
frozen_weights: None
backbone: resnet50
dilation: False
position_embedding: sine
enc_layers: 6
dec_layers: 6
num_queries: 100
dataset_file: vcoco
[Logger] DETR_HOI Arguments:
hoi_dec_layers: 6
hoi_nheads: 8
hoi_dim_feedforward: 2048
hoi_idx_loss_coef: 1
hoi_act_loss_coef: 1
hoi_eos_coef: 0.1
object_threshold: 0
[Logger] Number of total params: 56240935
[Logger] Number of trainable params: 14716167
Loading detr weights from args.detr_weights=pretrained/detr-r50-e632da11.pth
log(--nodes=2 --nproc_per_node=x) second node
/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : STIP_main.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 2
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.12.0.2:30001
rdzv_configs : {'rank': 1, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.12.0.2
master_port=30001
group_rank=1
group_world_size=2
local_ranks=[0, 1]
role_ranks=[3, 4]
global_ranks=[3, 4]
role_world_sizes=[5, 5]
global_world_sizes=[5, 5]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_oohtu0i7/none_d_b43gt4/attempt_0/1/error.json
| distributed init (rank 3): env://
| distributed init (rank 4): env://
[W ProcessGroupNCCL.cpp:1569] Rank 3 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 4 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node1:3786256:3786256 [1] NCCL INFO Bootstrap : Using [0]enp5s0:10.12.0.3<0>
Node1:3786257:3786257 [0] NCCL INFO Bootstrap : Using [0]enp5s0:10.12.0.3<0>
Node1:3786256:3786256 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node1:3786257:3786257 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node1:3786256:3786256 [1] NCCL INFO NET/IB : No device found.
Node1:3786257:3786257 [0] NCCL INFO NET/IB : No device found.
Node1:3786257:3786257 [0] NCCL INFO NET/Socket : Using [0]enp5s0:10.12.0.3<0>
Node1:3786256:3786256 [1] NCCL INFO NET/Socket : Using [0]enp5s0:10.12.0.3<0>
Node1:3786257:3786257 [0] NCCL INFO Using network Socket
Node1:3786256:3786256 [1] NCCL INFO Using network Socket
Node1:3786257:3786293 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node1:3786257:3786293 [0] NCCL INFO Trees [0] -1/-1/-1->4->3|3->4->-1/-1/-1 [1] 0/-1/-1->4->3|3->4->0/-1/-1
Node1:3786256:3786292 [1] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node1:3786256:3786292 [1] NCCL INFO Trees [0] 4/-1/-1->3->1|1->3->4/-1/-1 [1] 4/-1/-1->3->-1|-1->3->4/-1/-1
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 2[4000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 3[2000] -> 4[1000] via direct shared memory
Node1:3786257:3786293 [0] NCCL INFO Channel 00 : 4[1000] -> 0[1000] [send] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786257:3786293 [0] NCCL INFO Channel 00 : 4[1000] -> 3[2000] via direct shared memory
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 3[2000] -> 1[2000] [send] via NET/Socket/0
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 4[1000] -> 0[1000] [send] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Channel 00 : 1[2000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Channel 01 : 2[4000] -> 3[2000] [receive] via NET/Socket/0
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786256:3786292 [1] NCCL INFO Channel 01 : 3[2000] -> 4[1000] via direct shared memory
Node1:3786256:3786292 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 0[1000] -> 4[1000] [receive] via NET/Socket/0
Node1:3786257:3786293 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node1:3786257:3786293 [0] NCCL INFO Channel 01 : 4[1000] -> 3[2000] via direct shared memory
Node1:3786256:3786292 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node1:3786256:3786292 [1] NCCL INFO comm 0x7f0804002e10 rank 3 nranks 5 cudaDev 1 busId 2000 - Init COMPLETE
Node1:3786257:3786293 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node1:3786257:3786293 [0] NCCL INFO comm 0x7fe758002e10 rank 4 nranks 5 cudaDev 0 busId 1000 - Init COMPLETE
- with
--nnodes=5 --nproc_per_node=1
on both nodes, but run 3 instances on first node, and 2 instances on another.
In this setup, the program throws an exception and hang, my program is not started.
log(--nnodes=5 --nproc_per_node=1)
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_qmf3r7x5/none_wf___urt/attempt_0/0/error.json
| distributed init (rank 0): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1515236:1515236 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515236:1515236 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1515236:1515236 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515236:1515236 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515236:1515236 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1515236:1515389 [0] NCCL INFO Channel 00/02 : 0 1 2 3 4
Node0:1515236:1515389 [0] NCCL INFO Channel 01/02 : 0 1 2 3 4
Node0:1515236:1515389 [0] NCCL INFO threadThresholds 8/8/64 | 40/8/64 | 8/8/64
Node0:1515236:1515389 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->4|4->0->1/-1/-1
Node0:1515236:1515389 [0] NCCL INFO Channel 00 : 4[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515236:1515389 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via P2P/IPC
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] NCCL INFO Call to connect returned Connection refused, retrying
Node0:1515236:1515389 [0] include/socket.h:403 NCCL WARN Connect to 10.12.0.2<58663> failed : Connection refused
Node0:1515236:1515389 [0] NCCL INFO bootstrap.cc:95 -> 2
Node0:1515236:1515389 [0] NCCL INFO bootstrap.cc:363 -> 2
Node0:1515236:1515389 [0] NCCL INFO transport.cc:59 -> 2
Node0:1515236:1515389 [0] NCCL INFO init.cc:766 -> 2
Node0:1515236:1515389 [0] NCCL INFO init.cc:840 -> 2
Node0:1515236:1515389 [0] NCCL INFO group.cc:73 -> 2 [Async thread]
Traceback (most recent call last):
File "STIP_main.py", line 300, in <module>
main(args)
File "STIP_main.py", line 43, in main
utils.init_distributed_mode(args)
File "/home/train/projects/python/stip/src/util/misc.py", line 304, in init_distributed_mode
torch.distributed.barrier()
File "/home/train/.local/lib/python3.8/site-packages/torch/distributed/distributed_c10d.py", line 2524, in barrier
work = default_pg.barrier(opts=opts)
RuntimeError: NCCL error in: /pytorch/torch/lib/c10d/ProcessGroupNCCL.cpp:911, unhandled system error, NCCL version 2.7.8
ncclSystemError: System call (socket, malloc, munmap, etc) failed.
- no problem with
--nnodes=2 --nproc_per_node=2
while wasting 1 gpu.
log(--nnodes=2 --nproc_per_node=2)
/home/train/.local/lib/python3.8/site-packages/torch/distributed/launch.py:163: DeprecationWarning: The 'warn' method is deprecated, use 'warning' instead
logger.warn(
The module torch.distributed.launch is deprecated and going to be removed in future.Migrate to torch.distributed.run
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
INFO:torch.distributed.launcher.api:Starting elastic_operator with launch configs:
entrypoint : STIP_main.py
min_nodes : 2
max_nodes : 2
nproc_per_node : 2
run_id : none
rdzv_backend : static
rdzv_endpoint : 10.12.0.2:30001
rdzv_configs : {'rank': 0, 'timeout': 900}
max_restarts : 3
monitor_interval : 5
log_dir : None
metrics_cfg : {}
INFO:torch.distributed.elastic.agent.server.local_elastic_agent:log directory set to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1
INFO:torch.distributed.elastic.agent.server.api:[default] starting workers for entrypoint: python3
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous'ing worker group
/home/train/.local/lib/python3.8/site-packages/torch/distributed/elastic/utils/store.py:52: FutureWarning: This is an experimental API and will be changed in future.
warnings.warn(
INFO:torch.distributed.elastic.agent.server.api:[default] Rendezvous complete for workers. Result:
restart_count=0
master_addr=10.12.0.2
master_port=30001
group_rank=0
group_world_size=2
local_ranks=[0, 1]
role_ranks=[0, 1]
global_ranks=[0, 1]
role_world_sizes=[4, 4]
global_world_sizes=[4, 4]
INFO:torch.distributed.elastic.agent.server.api:[default] Starting worker group
INFO:torch.distributed.elastic.multiprocessing:Setting worker0 reply file to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1/attempt_0/0/error.json
INFO:torch.distributed.elastic.multiprocessing:Setting worker1 reply file to: /tmp/torchelastic_x1c0_2bl/none_tv20nqe1/attempt_0/1/error.json
| distributed init (rank 1): env://
| distributed init (rank 0): env://
[W ProcessGroupNCCL.cpp:1569] Rank 0 using best-guess GPU 0 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
[W ProcessGroupNCCL.cpp:1569] Rank 1 using best-guess GPU 1 to perform barrier as devices used by this process are currently unknown. This can potentially cause a hang if this rank to GPU mapping is incorrect.Specify device_ids in barrier() to force use of a particular device.
Node0:1515795:1515795 [0] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515795:1515795 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1515795:1515795 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515795:1515795 [0] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515795:1515795 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda11.1
Node0:1515796:1515796 [1] NCCL INFO Bootstrap : Using [0]enp6s0:10.12.0.2<0>
Node0:1515796:1515796 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
Node0:1515796:1515796 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
Node0:1515796:1515796 [1] NCCL INFO NET/Socket : Using [0]enp6s0:10.12.0.2<0>
Node0:1515796:1515796 [1] NCCL INFO Using network Socket
Node0:1515796:1515826 [1] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
Node0:1515796:1515826 [1] NCCL INFO Trees [0] 2/-1/-1->1->0|0->1->2/-1/-1 [1] -1/-1/-1->1->0|0->1->-1/-1/-1
Node0:1515795:1515825 [0] NCCL INFO Channel 00/02 : 0 1 2 3
Node0:1515795:1515825 [0] NCCL INFO Channel 01/02 : 0 1 2 3
Node0:1515795:1515825 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 8/8/64
Node0:1515795:1515825 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1|-1->0->1/-1/-1 [1] 1/-1/-1->0->3|3->0->1/-1/-1
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515795:1515825 [0] NCCL INFO Channel 00 : 3[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515795:1515825 [0] NCCL INFO Channel 00 : 0[1000] -> 1[2000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 1[2000] -> 2[1000] [send] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 2[1000] -> 1[2000] [receive] via NET/Socket/0
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515796:1515826 [1] NCCL INFO Channel 00 : 1[2000] -> 0[1000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 3[2000] -> 0[1000] [receive] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 0[1000] -> 1[2000] via direct shared memory
Node0:1515795:1515825 [0] NCCL INFO Could not enable P2P between dev 0(=1000) and dev 1(=2000)
Node0:1515796:1515826 [1] NCCL INFO Channel 01 : 1[2000] -> 2[1000] [send] via NET/Socket/0
Node0:1515796:1515826 [1] NCCL INFO Could not enable P2P between dev 1(=2000) and dev 0(=1000)
Node0:1515796:1515826 [1] NCCL INFO Channel 01 : 1[2000] -> 0[1000] via direct shared memory
Node0:1515796:1515826 [1] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1515796:1515826 [1] NCCL INFO comm 0x7faa90002e10 rank 1 nranks 4 cudaDev 1 busId 2000 - Init COMPLETE
Node0:1515795:1515825 [0] NCCL INFO Channel 01 : 0[1000] -> 3[2000] [send] via NET/Socket/0
Node0:1515795:1515825 [0] NCCL INFO 2 coll channels, 2 p2p channels, 1 p2p channels per peer
Node0:1515795:1515825 [0] NCCL INFO comm 0x7f5054002e10 rank 0 nranks 4 cudaDev 0 busId 1000 - Init COMPLETE
Node0:1515795:1515795 [0] NCCL INFO Launch mode Parallel
loading annotations into memory...
Done (t=1.01s)
creating index...
index created!
loading annotations into memory...
Done (t=1.02s)
creating index...
index created!
[Logger] DETR Arguments:
lr: 5e-05
lr_backbone: 1e-05
lr_drop: 80
frozen_weights: None
backbone: resnet50
dilation: False
position_embedding: sine
enc_layers: 6
dec_layers: 6
num_queries: 100
dataset_file: vcoco
[Logger] DETR_HOI Arguments:
hoi_dec_layers: 6
hoi_nheads: 8
hoi_dim_feedforward: 2048
hoi_idx_loss_coef: 1
hoi_act_loss_coef: 1
hoi_eos_coef: 0.1
object_threshold: 0
[Logger] Number of total params: 56240935
[Logger] Number of trainable params: 14716167
Loading detr weights from args.detr_weights=pretrained/detr-r50-e632da11.pth
>>> Epoch #1
- no problem with
--nnodes=1 --nproc_per_node=3
while wasting 1 node (2 gpus)