Get stuck in clip_grad and optimizer.step using multi-GPU

Hi, I meet a problem when train my model using multi-gpu on one node with nn.parallel.DistributedDataParallel. And the version of PyTorch is 0.4.1.
I use the following comment to run the program

export NGPU=3;
python -m torch.distributed.launch --nproc_per_node=$NGPU train.py

and it get stuck in clip_grad_norm_() and optimizer.step(). When I use nvidia-smi to check the utilization rate of the GPUs, all three of them are 100%.
However, when I set NGPU=1, the program can be executed correctly.
What’s the reason? How can I fix it? Thanks!