I tried to train MNIST using torch.distributed.launch
nccl backend
The launch command
export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=true # use or not does not change the results
echo "NCCL_IB_DISABLE=$NCCL_IB_DISABLE"
export NCCL_SOCKET_IFNAME=eno1,eth0 # use or not does not change the results
python3 -m torch.distributed.launch --nproc_per_node 2 \
--nnodes 1 \
--node_rank 0 \
--master_addr="0.0.0.0" \
--master_port=2333 \
main.py \
--epochs 3 \
--lr 1e-3 \
--batch_size 150
-
gloo
backend works just fine -
nccl
got stuck - i have tried suggestions on the forum but none of them worked
debug info:
sh start-dist-train.sh
NCCL_IB_DISABLE=true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
nccl
nccl
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
torch-research-2gpu-0:48249:48249 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation
torch-research-2gpu-0:48250:48250 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Using network Socket