Distributed data parallel freezes without error message

Feywell · September 17, 2020, 6:59am

I use pytorch-nightly 1.7 and nccl 2.7.6, but the problem is also exist. I cannot distributed training.

iamshnik · September 17, 2020, 9:49am

Can you help to reproduce this issue? Maybe, try to run a part of your code on Google Colab and share the link if you face the same problem again.

Feywell · September 17, 2020, 10:21am

Ok, I will try to use Colab reproduce this issue.

neonbjb · October 17, 2020, 6:36pm

I’m on NCCL 2.7.8 and also seeing this issue. The model I am training is hopelessly complex (a GAN with 3 networks and 10 losses) so it’s going to be quite a bit of work to pare it down to the point where I can share a reproduction snippet here. For now just adding some additional signal to this topic.

:edit: Also in my case, the GPU hangs at 100% utilization. It is actually very reproducible for me so if there is some debugging I can do, please LMK.

ptrblck · October 18, 2020, 10:45am

This could indicate a died process, while the other processes are waiting for it and thus spinning the wait kernels at full util.
You could check, if a process was terminated via ps auxf and also might get more information by killing the current parent process and check the stack trace.

111411 · November 29, 2020, 6:50am

@ptrblck

Is there any updated to this freeze w/o error message?

I tried to follow the example in
(Optional: Data Parallelism — PyTorch Tutorials 2.2.0+cu121 documentation) here, but I always get stuck in the “forward” function. There are no errors, but the whole process seemed to stop.

My system spec is
Ubuntu 18.04, Cuda10.1, PyTorch1.7.0 and I have 2 gpus.

ptrblck · November 30, 2020, 12:38am

There is no general root cause for a freeze to the best of my knowledge and there wasn’t a follow-up on my last post, so unsure if the issue was isolated or not.

Generally, I would recommend to try to scale down the issue e.g. by using a single worker in the DataLoader, try to run the script in a terminal etc.

111411 · December 15, 2020, 11:21am

I found the answer!

modify /etc/default/grub

#GRUB_CMDLINE_LINUX=""                           <—— Original commented
GRUB_CMDLINE_LINUX="iommu=soft"           <——— Change

ref : https://github.com/pytorch/pytorch/issues/1637#issuecomment-338268158

LiyuanLucasLiu · August 30, 2021, 4:07am

This resolves the problem for me:

export NCCL_P2P_DISABLE=1

vgg_17 · December 10, 2021, 7:59pm

Thanks, really helpful.

when i was stuck in GPU=A6000 and Pytorch=1.10 with apex=0.1, this answer help me to get rid of the halting caused by apex.parallel import DistributedDataParallel.

carlosvic · July 19, 2022, 10:17pm

Many thanks. This solves my problem at RTX A6000 as well

yasumori · January 24, 2023, 11:16pm

This resolved my error too. pytorch=1.12.1

Shivang_Patel1 · February 10, 2023, 6:56pm

See if this can help you because disabling it might not help in reducing the latency.

github.com/NVIDIA/nccl

NCCL hangs in the case GPUs are linked through PXB or PIX

opened 07:01AM - 11 Apr 22 UTC

closed 05:01AM - 13 Apr 22 UTC

LouieShao

GPU0 GPU1 GPU2 GPU3 CPU Affinity NUMA Affinity GPU0 X PXB SYS SYS 0-27,56-83 …0 GPU1 PXB X SYS SYS 0-27,56-83 0 GPU2 SYS SYS X PIX 28-55,84-111 1 GPU3 SYS SYS PIX X 28-55,84-111 1 The topology is as above. all_reduce_perf works fine only with those linked through SYS. The rest will result in hanging. Setting NCCL_P2P_DISABLE=1 did workaround but got really low speed. NCCL version 2.10.X Any advice would be appreciated.

Jerry-Master · July 10, 2023, 1:10pm

I am also facing a similar issue, but mine is quite concerning because neither of the above seems to fix it. I am using pytorch 2.0 with CUDA 11.7 and nccl 2.14.3. The ACS is disabled and the nccl tests all work perfectly fine. Even the cuda-samples tests like simpleP2P and simpleMultiGPU work fine. But when I launch my program it hangs after 2 or 3 hours of training with no message whatsoever. How can I even debug that? The code works perfectly fine in a non-distributed setting.

Here is my topology:
GPU0 GPU1 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV4 0-27 N/A N/A
GPU1 NV4 X 0-27 N/A N/A

And here is the output of my command when run with NCCL_DEBUG=INFO:

user:36979:36979 [0] NCCL INFO Bootstrap : Using eno1:192.168.151.5<0>

user:36979:36979 [0] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user:36979:36979 [0] NCCL INFO cudaDriverVersion 12020

NCCL version 2.14.3+cuda11.7

user:36979:37087 [0] NCCL INFO NET/IB : No device found.

user:36979:37087 [0] NCCL INFO NET/Socket : Using [0]eno1:192.168.151.5<0> [1]virbr0:192.168.122.1<0>

user:36979:37087 [0] NCCL INFO Using network Socket

user:36980:36980 [1] NCCL INFO cudaDriverVersion 12020

user:36980:36980 [1] NCCL INFO Bootstrap : Using eno1:192.168.151.5<0>

user:36980:36980 [1] NCCL INFO NET/Plugin : No plugin found (libnccl-net.so), using internal implementation

user:36980:37091 [1] NCCL INFO NET/IB : No device found.

user:36980:37091 [1] NCCL INFO NET/Socket : Using [0]eno1:192.168.151.5<0> [1]virbr0:192.168.122.1<0>

user:36980:37091 [1] NCCL INFO Using network Socket

user:36979:37087 [0] NCCL INFO Setting affinity for GPU 0 to 0fffffff

user:36980:37091 [1] NCCL INFO Setting affinity for GPU 1 to 0fffffff

user:36979:37087 [0] NCCL INFO Channel 00/04 : 0 1

user:36979:37087 [0] NCCL INFO Channel 01/04 : 0 1

user:36979:37087 [0] NCCL INFO Channel 02/04 : 0 1

user:36980:37091 [1] NCCL INFO Trees [0] -1/-1/-1->1->0 [1] 0/-1/-1->1->-1 [2] -1/-1/-1->1->0 [3] 0/-1/-1->1->-1

user:36979:37087 [0] NCCL INFO Channel 03/04 : 0 1

user:36979:37087 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] -1/-1/-1->0->1 [2] 1/-1/-1->0->-1 [3] -1/-1/-1->0->1

user:36979:37087 [0] NCCL INFO Channel 00/0 : 0[a1000] → 1[c1000] via P2P/IPC

user:36979:37087 [0] NCCL INFO Channel 01/0 : 0[a1000] → 1[c1000] via P2P/IPC

user:36980:37091 [1] NCCL INFO Channel 00/0 : 1[c1000] → 0[a1000] via P2P/IPC

user:36979:37087 [0] NCCL INFO Channel 02/0 : 0[a1000] → 1[c1000] via P2P/IPC

user:36980:37091 [1] NCCL INFO Channel 01/0 : 1[c1000] → 0[a1000] via P2P/IPC

user:36979:37087 [0] NCCL INFO Channel 03/0 : 0[a1000] → 1[c1000] via P2P/IPC

user:36980:37091 [1] NCCL INFO Channel 02/0 : 1[c1000] → 0[a1000] via P2P/IPC

user:36980:37091 [1] NCCL INFO Channel 03/0 : 1[c1000] → 0[a1000] via P2P/IPC

user:36979:37087 [0] NCCL INFO Connected all rings

user:36979:37087 [0] NCCL INFO Connected all trees

user:36979:37087 [0] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512

user:36979:37087 [0] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer

user:36980:37091 [1] NCCL INFO Connected all rings

user:36980:37091 [1] NCCL INFO Connected all trees

user:36980:37091 [1] NCCL INFO threadThresholds 8/8/64 | 16/8/64 | 512 | 512

user:36980:37091 [1] NCCL INFO 4 coll channels, 4 p2p channels, 4 p2p channels per peer

user:36979:37087 [0] NCCL INFO comm 0x4aa53cf0 rank 0 nranks 2 cudaDev 0 busId a1000 - Init COMPLETE

user:36980:37091 [1] NCCL INFO comm 0x6ade3ee0 rank 1 nranks 2 cudaDev 1 busId c1000 - Init COMPLETE

Any help is appreciated!

ptrblck · July 10, 2023, 6:15pm

You could gdb attach to the hanging processes and check their stacktraces to narrow down the issue more assuming you’ve already used the logging env variables such as TORCH_DISTRIBUTED_DEBUG and TORCH_CPP_LOG_LEVEL and didn’t see anything concerning.