Single node 2 GPU distributed training nccl-backend hanged

Chenchao_Zhao · June 14, 2021, 5:19pm

I tried to train MNIST using torch.distributed.launch nccl backend

The launch command

export NCCL_DEBUG=INFO
export NCCL_IB_DISABLE=true  # use or not does not change the results
echo "NCCL_IB_DISABLE=$NCCL_IB_DISABLE"
export NCCL_SOCKET_IFNAME=eno1,eth0  # use or not does not change the results

python3 -m torch.distributed.launch --nproc_per_node 2 \
                                   --nnodes 1 \
                                   --node_rank 0 \
                                   --master_addr="0.0.0.0" \
                                   --master_port=2333 \
                                   main.py \
                                   --epochs 3 \
                                   --lr 1e-3 \
                                   --batch_size 150

gloo backend works just fine
nccl got stuck
i have tried suggestions on the forum but none of them worked

debug info:

sh start-dist-train.sh
NCCL_IB_DISABLE=true
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
*****************************************
nccl
nccl
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

torch-research-2gpu-0:48249:48249 [0] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48249:48249 [0] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48249:48249 [0] NCCL INFO Using network Socket
NCCL version 2.7.8+cuda10.2
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Bootstrap : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

torch-research-2gpu-0:48250:48250 [1] misc/ibvwrap.cc:63 NCCL WARN Failed to open libibverbs.so[.1]
torch-research-2gpu-0:48250:48250 [1] NCCL INFO NET/Socket : Using [0]eth0:10.244.26.37<0>
torch-research-2gpu-0:48250:48250 [1] NCCL INFO Using network Socket

cbalioglu · June 14, 2021, 7:16pm

Hey Chenchao,

Couple questions:

Which PyTorch version are you using?
What is your OS/GPU setup?
Is it possible to share your script?
Is this behavior reproducible at every run?

Chenchao_Zhao · June 14, 2021, 7:58pm

Hey Can,

pytorch version 1.8.1-cu102
the instance is kubeflow notebook server
container image is ubuntu:20.04
behavior is reproducible

I fixed the issue by setting master ip to localhost
export NCCL_SOCKET_IFNAME =lo

I haven’t tried a multi-node training with several GPUs per node yet.

cbalioglu · June 14, 2021, 8:21pm

Cool, glad that you could fix the problem. Let us know if you experience any issues with a multi-node setup.