Single node 2 GPU distributed training nccl-backend hanged

I tried to train MNIST using torch.distributed.launch nccl backend

The launch command

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=true  # use or not does not change the results
echo "NCCL_IB_DISABLE=$NCCL_IB_DISABLE"
export NCCL_SOCKET_IFNAME=eno1,eth0  # use or not does not change the results

python3 -m torch.distributed.launch --nproc_per_node 2 \
                                   --nnodes 1 \
                                   --node_rank 0 \
                                   --master_addr="0.0.0.0" \
                                   --master_port=2333 \
                                   main.py \
                                   --epochs 3 \
                                   --lr 1e-3 \
                                   --batch_size 150
  1. gloo backend works just fine
  2. nccl got stuck
  3. i have tried suggestions on the forum but none of them worked

debug info:

sh start-dist-train.sh
NCCL_IB_DISABLE=true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
nccl
nccl
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

torch-research-2gpu-0:48249:48249 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

torch-research-2gpu-0:48250:48250 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Using network Socket

Hey Chenchao,

Couple questions:

  • Which PyTorch version are you using?
  • What is your OS/GPU setup?
  • Is it possible to share your script?
  • Is this behavior reproducible at every run?

Hey Can,

  • pytorch version 1.8.1-cu102
  • the instance is kubeflow notebook server
  • container image is ubuntu:20.04
  • behavior is reproducible

I fixed the issue by setting master ip to localhost
export NCCL_SOCKET_IFNAME =lo

I haven’t tried a multi-node training with several GPUs per node yet.

Cool, glad that you could fix the problem. Let us know if you experience any issues with a multi-node setup.

1 Like