Distributed data parallel freezes without error message

I use pytorch-nightly 1.7 and nccl 2.7.6, but the problem is also exist. I cannot distributed training.

Can you help to reproduce this issue? Maybe, try to run a part of your code on Google Colab and share the link if you face the same problem again.

Ok, I will try to use Colab reproduce this issue.

Iโ€™m on NCCL 2.7.8 and also seeing this issue. The model I am training is hopelessly complex (a GAN with 3 networks and 10 losses) so itโ€™s going to be quite a bit of work to pare it down to the point where I can share a reproduction snippet here. For now just adding some additional signal to this topic.

:edit: Also in my case, the GPU hangs at 100% utilization. It is actually very reproducible for me so if there is some debugging I can do, please LMK.

This could indicate a died process, while the other processes are waiting for it and thus spinning the wait kernels at full util.
You could check, if a process was terminated via ps auxf and also might get more information by killing the current parent process and check the stack trace.

@ptrblck

Is there any updated to this freeze w/o error message?

I tried to follow the example in
(https://pytorch.org/tutorials/beginner/blitz/data_parallel_tutorial.html) here, but I always get stuck in the โ€œforwardโ€ function. There are no errors, but the whole process seemed to stop.

My system spec is
Ubuntu 18.04, Cuda10.1, PyTorch1.7.0 and I have 2 gpus.

There is no general root cause for a freeze to the best of my knowledge and there wasnโ€™t a follow-up on my last post, so unsure if the issue was isolated or not.

Generally, I would recommend to try to scale down the issue e.g. by using a single worker in the DataLoader, try to run the script in a terminal etc.

I found the answer!

modify /etc/default/grub

#GRUB_CMDLINE_LINUX=""                           <โ€”โ€” Original commented
GRUB_CMDLINE_LINUX="iommu=soft"           <โ€”โ€”โ€” Change

ref : https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158

1 Like

This resolves the problem for me:

export NCCL_P2P_DISABLE=1
4 Likes

Thanks, really helpful.

when i was stuck in GPU=A6000 and Pytorch=1.10 with apex=0.1, this answer help me to get rid of the halting caused by apex.parallel import DistributedDataParallel.

Many thanks. This solves my problem at RTX A6000 as well

This resolved my error too. pytorch=1.12.1