Problem abount fsdp training. How to select cudatoolkit version of nvidia-nccl-cu12?

takumi_OSHITA · March 5, 2024, 1:07pm

Here is the translated text:

"Hello, while conducting distributed learning using FSDP, I encounter the following error between saving weights and loading the saved checkpoint:

torch.distributed.DistBackendError: NCCL error: …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1331
In my research, I suspect that there might be a high possibility of incompatibility between torch and the cudatoolkit of nvidia-nccl-cu. I also noticed that when installing torch from pytorch.org, it automatically installs nvidia-nccl-cu compatible with cudatoolkit12.3.

“Additionally, my computing environment is as follows:
DGXH100 1 node NVIDIA DRIVER 535.161.07”

Is there a way to install packages like nvidia-nccl-cu that solve the above problem for torch versions 2.1.0 and above?"

yf225 · March 6, 2024, 1:09am

Installing torch from pytorch.org will also install its packaged CUDA and NCCL.

Does it work if you uninstall your custom installation of nvidia-nccl-cu, and then install torch from pytorch.org?

takumi_OSHITA · March 6, 2024, 1:51am

Thank you for your reply

I installed it using the following command. The problem occurred with the CUDA and NCCL packages installed here.

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118

The issue is that the version of nvidia-nccl-cu installed by the above command is not compatible with the cudatoolkit that is installed alongside torch. More information can be found at the following website:

Specifically, the version of nvidia-nccl-cu installed by pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118 is 2.19.3, which is not compatible with CUDA 11.8.

Is it possible to obtain torch version greater than 2.1.0 without causing the above problem from pytorch.org?

ptrblck · March 6, 2024, 5:29am

That’s not the case as the CUDA version tagged in the NCCL binaries is the corresponding CUDA version used to build the binaries while the CUDA runtime is statically linked into it, as is the common approach.
I would thus be interested which part of the docs indicates the incompatibility you are claiming.

takumi_OSHITA · March 6, 2024, 6:32am

This issue occurred when installing certain versions of PyTorch (2.2.0 or higher). When I installed version 2.1.2 or lower from pytorch.org, it did not install anything related to CUDA or NCCL (like nvidia-nccl-cu, nvidia-cudnn, etc.), which resolved the problem.

ptrblck was correct; my understanding of the CUDA version for NCCL was inaccurate. Therefore, the discrepancy between the CUDA version of NCCL and the CUDA version used by PyTorch, which I mentioned, does not seem to be a critical issue.

Ultimately, I am still unable to pinpoint the direct cause of the problem…
Is this a problem with torch 2.2.0 or higher?

ptrblck · March 6, 2024, 6:38am

Could you rerun your workload with the debug env variables mentioned here as well as NCCL_DEBUG=INFO and post the logs?

takumi_OSHITA · March 6, 2024, 6:46am

Sure, please be patient as it will take some time to reproduce the error.

takumi_OSHITA · March 6, 2024, 7:38am

@ptrblck

DGXH100:2063:2063 [7] NCCL INFO cudaDriverVersion 12020
DGXH100:2063:2063 [7] NCCL INFO Bootstrap : Using eth0:172.28.0.2<0>
DGXH100:2063:2063 [7] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
DGXH100:2063:2063 [7] NCCL INFO init.cc:1627 Cuda Host Alloc Size 4 pointer 0x7fecd9e00000
DGXH100:2063:5450 [7] NCCL INFO Failed to open libibverbs.so[.1]
DGXH100:2063:5450 [7] NCCL INFO NET/Socket : Using [0]eth0:172.28.0.2<0>
DGXH100:2063:5450 [7] NCCL INFO Using non-device net plugin version 0
DGXH100:2063:5450 [7] NCCL INFO Using network Socket
DGXH100:2058:4232 [2] NCCL INFO comm 0x563ffa9eec60 rank 2 nrDGXH100:2056:4228 [0] NCCL INFO comm 0x55a7e8ff8cb0 rank 0 nranks 8 cudaDev 0 DGXH100:2058:4232 [2] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 ‘eth0’
DGXH100:2058:4232 [2] NCCL INFO === System : maxBw 360.0 totalBw 360.0 ===
DGXH100:2058:4232 [2] NCCL INFO CPU/0 (1/1/2)
DGXH100:2058:4232 [2] NCCL INFO + PCI[5000.0] - NIC/0
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/16000 (15b3197900000000)
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - PCI/19000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/1B000 (0)
DGXH100:2058:4232 [2] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - PCI/3E000 (15b3197900000000)
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - PCI/41000 (15b3197900000000)
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - GPU/43000 (1)
DGXH100:2058:4232 [2] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - PCI/4D000 (15b3197900000000)
DGXH100:2058:4232 [2] NCCL INFO + PCI[48.0] - PCI/50000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/52000 (2)
DGXH100:2060:4993 [4] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/5C000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/5F000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/61000 (3)
DGXH100:2057:4230 [1] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2057:4230 [1] NCCL INFO + SYS[10.0] - CPU/1
DGXH100:2057:4230 [1] NCCL INFO CPU/1 (1/1/2)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/98000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/9B000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/9D000 (4)
DGXH100:2060:4993 [4] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/BE000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/C1000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/C3000 (5)
DGXH100:2060:4993 [4] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/CC000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/CF000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/D1000 (6)
DGXH100:2060:4993 [4] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/DA000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - PCI/DD000 (15b3197900000000)
DGXH100:2060:4993 [4] NCCL INFO + PCI[48.0] - GPU/DF000 (7)
DGXH100:2060:4993 [4] NCCL INFO + NVL[360.0] - NVS/0
DGXH100:2060:4993 [4] NCCL INFO + SYS[10.0] - CPU/0
DGXH100:2060:4993 [4] NCCL INFO ==========================================
DGXH100:2060:4993 [4] NCCL INFO GPU/1B000 :GPU/1B000 (0/5000.000000/LOC) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (3/48.000000/PHB) CPU/1 (4/10.000000/SYS)
DGXH100:2060:4993 [4] NCCL INFO GPU/43000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (0/5000.000000/LOC) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (3/48.000000/PHB) CPU/1 (4/10.000000/SYS)
DGXH100:2057:4230 [1] NCCL INFO GPU/52000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (0/5000.000000/LOC) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (3/48.000000/PHB) CPU/1 (4/10.000000/SYS)
DGXH100:2060:4993 [4] NCCL INFO GPU/61000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (0/5000.000000/LOC) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (3/48.000000/PHB) CPU/1 (4/10.000000/SYS)
DGXH100:2060:4993 [4] NCCL INFO GPU/9D000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (0/5000.000000/LOC) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/48.000000/PHB)
DGXH100:2057:4230 [1] NCCL INFO GPU/C3000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (0/5000.000000/LOC) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/48.000000/PHB)
DGXH100:2057:4230 [1] NCCL INFO GPU/D1000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (0/5000.000000/LOC) GPU/DF000 (2/360.000000/NVL) NVS/0 (1/360.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/48.000000/PHB)
DGXH100:2057:4230 [1] NCCL INFO GPU/DF000 :GPU/1B000 (2/360.000000/NVL) GPU/43000 (2/360.000000/NVL) GPU/52000 (2/360.000000/NVL) GPU/61000 (2/360.000000/NVL) GPU/9D000 (2/360.000000/NVL) GPU/C3000 (2/360.000000/NVL) GPU/D1000 (2/360.000000/NVL) GPU/DF000 (0/5000.000000/LOC) NVS/0 (1/360.000000/NVL) CPU/0 (4/10.000000/SYS) CPU/1 (3/48.000000/PHB)
DGXH100:2057:4230 [1] NCCL INFO Setting affinity for GPU 1 to ff,ffffffff,ffff0000,00000000,00ffffff,ffffffff
DGXH100:2057:4230 [1] NCCL INFO NVLS multicast support is available on dev 1
DGXH100:2057:4230 [1] NCCL INFO Pattern 4, crossNic 0, nChannels 12, bw 30.000000/30.000000, type NVL/PIX, sameChannels 1
DGXH100:2057:4230 [1] NCCL INFO 0 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
DGXH100:2057:42DGXH100:2063:5450 [7] NCCL INFO 1 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPUDGXH100:2057:4230 [1] NCCL INFO 2 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
DGXH100:2057:4230 [1] NCCL INFO 3 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
DGXH100:2057:4230 [1] NCCL INFO 4 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
DGXH100:2057:4230 [1] NCCL INFO 5 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
DGXH100:2057:4230 [1] NCCL INFO 6 : GPU/0 GPU/1 GPU/2 GPU/3 GPU/4 GPU/5 GPU/6 GPU/7
ecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Channel 19/0 : 7[7] → 0[0] via P2P/CUMEM
DGXH100:2063:5595 [7] NCCL INFO New proxy send connection 43 from local rank 7, transport 0
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=1 op.reqBuff=0x7feca40465f0 op.respSize=16 done
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=Init res=0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Connected to proxy localRank 7 → connection 0x7feca4006348
DGXH100:2063:5595 [7] NCCL INFO Allocated shareable buffer 0x7fedf3c00000 size 2097152 ipcDesc 0x7feca4047d40
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=3 op.reqBuff=0x7feca4047d10 op.respSize=80 done
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=Setup res=0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Channel 20/0 : 7[7] → 0[0] via P2P/CUMEM
DGXH100:2063:5595 [7] NCCL INFO New proxy send connection 44 from local rank 7, transport 0
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=1 op.reqBuff=0x7feca4047d10 op.respSize=16 done
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=Init res=0
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Connected to proxy localRank 7 → connection 0x7feca40063c0
DGXH100:2063:5595 [7] NCCL INFO Allocated shareable buffer 0x7fedf3e00000 size 2097152 ipcDesc 0x7feca4049460
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=3 op.reqBuff=0x7feca4049430 op.respSize=80 done
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=Setup res=0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0

DGXH100:2063:5450 [7] NCCL INFO NVLS comm 0x556493aa0df0 headRank 7 nHeads 8 buffSize 4194304 memSize 2097152 nvlsPerRankSize 201326592 nvlsTotalSize 1610612736
DGXH100:2063:5450 [7] NCCL INFO NVLS importing shareableHandle 0x7fecac546c18 from rank 0
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Connected to proxy localRank 0 → connection 0x7f425c074d58
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x706d64e6695114d0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x706d64e6695114d0 matches expected opId=0x706d64e6695114d0
DGXH100:2063:5450 [7] NCCL INFO NVLS group 7fecac6a3980 adding dev 7
DGXH100:2063:5450 [7] NCCL INFO NVLS Mapped UC at 0xa20000000 size 1610612736
DGXH100:2063:5450 [7] NCCL INFO NVLS Bind mem 0xa20000000 UC handle 0x7fecac6a4170 MC handle 0x7fecac6a3980 size 1610612736
DGXH100:2063:5450 [7] NCCL INFO NVLS Mapped MC buffer at 0xa80000000 size 1610612736
DGXH100:2063:5450 [7] NCCL INFO Connected NVLS tree
DGXH100:2063:5450 [7] NCCL INFO threadThresholds 8/8/64 | 64/8/64 | 512 | 512
DGXH100:2063:5450 [7] NCCL INFO 24 coll channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
DGXH100:2063:5595 [7] NCCL INFO Allocated 4194660 bytes of shared memory in /dev/shm/nccl-onguMb
DGXH100:2063:5595 [7] NCCL INFO New proxy send connection 144 from local rank 7, transport 2
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=1 op.reqBuff=0x7feca406fbd0 op.respSize=16 done
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=Init res=0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO Connected to proxy localRank 7 → connection 0x7feca4071a90
DGXH100:2063:5595 [7] NCCL INFO Allocated shareable buffer 0xa02000000 size 268435456 ipcDesc 0x7feca4075538
DGXH100:2063:5595 [7] NCCL INFO proxyProgressAsync opId=0x7fecaca277b0 op.type=2 op.reqBuff=0x7feca40712f0 op.respSize=0 done
DGXH100:2063:5595 [7] NCCL INFO Received and initiated operation=SharedInit res=0
DGXH100:2063:5450 [7] NCCL INFO ncclPollProxyResponse Received new opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO resp.opId=0x7fecaca277b0 matches expected opId=0x7fecaca277b0
DGXH100:2063:5450 [7] NCCL INFO init.cc:415 Cuda Alloc Size 7744 pointer 0x7fe82de00000
DGXH100:2063:5450 [7] NCCL INFO init.cc:441 Cuda Host Alloc Size 33554432 pointer 0x7feeb8000000
DGXH100:2063:5450 [7] NCCL INFO init.cc:447 Cuda Host Alloc Size 128 pointer 0x7fecd9ec0200
DGXH100:2063:5450 [7] NCCL INFO Tuner: plugin load ‘(null)’ returned error (11 : (null)), using default tuner instead.
DGXH100:2063:5450 [7] NCCL INFO comm 0x556493aa0df0 rank 7 nranks 8 cudaDev 7 nvmlDev 7 busId df000 commId 0xba6b5f888d5d7009 - Init COMPLETE
DGXH100:2063:2063 [7] NCCL INFO AllGather: opCount 0 sendbuff 0x7feeac000000 recvbuff 0x7fecba000000 count 66592768 datatype 0 op 0 root 0 comm 0x556493aa0df0 [nranks=8] stream 0x5564901f9a20
DGXH100:2063:2063 [7] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
DGXH100:2063:2063 [7] NCCL INFO AllGather: opCount 1 sendbuff 0x7feeac000000 recvbuff 0x7fec38000000 count 50595840 datatype 0 op 0 root 0 comm 0x556493aa0df0 [nranks=8] stream 0x5564901f9a20
DGXH100:2063:2063 [7] NCCL INFO AllGather: opCount 2 sendbuff 0x7feeac000000 recvbuff 0x7fe8d2000000 count 50595840 datatype 0 op 0 root 0 comm 0x556493aa0df0 [nranks=8] stream 0x5564901f9a20
DGXH100:2063:2063 [7] NCCL INFO AllGather: opCount 3 sendbuff 0x7fecf2e00000 recvbuff 0x7fe8b8000000 count 50595840 datatype 0 op 0 root 0 comm 0x556493aa0df0 [nranks=8] stream 0x5564901f9a20

DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 192 sendbuff 0x7fd488000000 recvbuff 0x7fd1e858b000 count 133185536 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 193 sendDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 193 sendbuff 0x7f2878000000 recvbuff 0x7f1a93820000 count 133185536 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 194 sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 194 sendbuff 0x7f2613f04000 recvbuff 0x7f25d060c000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 195 sDGXH100:2059:2059 [3] NDGXH100:2060:2060 [4] NCCL INFO AllGather: opCount 195 sendbuff 0x7fa362a00000 recvbuff 0x7f946a000000 count 1011916DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 196 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 196 sendbuff 0x7fe934000000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 197 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 197 sendbuff 0x7fea74000000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 198 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 198 sendbuff 0x7fea12a81000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 199 sDGXH100:2059DGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 199 sendbuff 0x7f2b00000000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19a sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 19a sendbuff 0x7fea7a081000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19b sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 19b sendbuff 0x7fea8c204000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19c sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 19c sendbuff 0x7fea86183000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19d sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 19d sendbuff 0x7f2b18204000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19e sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 19e sendbuff 0x7fe946000000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 19f sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 19f sendbuff 0x7fe958183000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a0 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a0 sendbuff 0x7fe952102000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a1 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a1 sendbuff 0x7fe50c000000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a2 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a2 sendbuff 0x7fe95e204000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a3 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a3 sendbuff 0x7fe518102000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a4 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a4 sendbuff 0x7fe512081000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a5 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a5 sendbuff 0x7fe524204000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a6 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1a6 sendbuff 0x7fe51e183000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a7 sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1a7 sendbuff 0x7f2ac6081000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a8 sendbuff 0x7fce1e285000 recvbuff 0x7fd1e858b000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1a9 sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1a9 sendbuff 0x7f270c000000 recvbuff 0x7f1a54000000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1aa sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1aa sendbuff 0x7f220e387000 recvbuff 0x7f25d060c000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ab sDGXH100:2059DGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1ab sendbuff 0x7f2ade285000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ac sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ac sendbuff 0x7fea3a081000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ad sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ad sendbuff 0x7fea4c204000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ae sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ae sendbuff 0x7fea46183000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1af sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1af sendbuff 0x7fea58306000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b0 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1b0 sendbuff 0x7fea52285000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b1 sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1b1 sendbuff 0x7f28ce081000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b2 sendbuff 0x7fd352387000 recvbuff 0x7fd1e858b000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b3 sendDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1b3 sendbuff 0x7f2504000000 recvbuff 0x7f1a54000000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b4 sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1b4 sendbuff 0x7f28d4102000 recvbuff 0x7f1e38000000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b5 sDGXH100:2059:2059 [3] NDGXH100:2060:2060 [4] NCCL INFO AllGather: opCount 1b5 sendbuff 0x7fa184102000 recvbuff 0x7f946a000000 count 1011916DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b6 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1b6 sendbuff 0x7fe832081000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b7 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1b7 sendbuff 0x7fe844204000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b8 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1b8 sendbuff 0x7fe83e183000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1b9 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1b9 sendbuff 0x7fe850306000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ba sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ba sendbuff 0x7fe84a285000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1bb sDGXH100:2059DGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1bb sendbuff 0x7f290a58b000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1bc sDGXH100:2059DGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1bc sendbuff 0x7f290450a000 recvbuff 0x7f1e38000000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1bd sendbuff 0x7fd15a50a000 recvbuff 0x7fc3ca000000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1be sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1be sendbuff 0x7f291060c000 recvbuff 0x7f1e38000000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1bf sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1bf sendbuff 0x7f254c60c000 recvbuff 0x7f1a54000000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c0 sendDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1c0 sendbuff 0x7f254658b000 recvbuff 0x7f25d060c000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c1 sDGXDGXH100:2058:2058 [2DGXH100:2060:2060 [4] NCCL INFO AllGather: opCount 1c1 sendbuff 0x7fa1cc70e000 recvbuff 0x7f946a000000 count 1011916DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c2 sendDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1c2 sendbuff 0x7f255268d000 recvbuff 0x7f25d060c000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c3 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1c3 sendbuff 0x7fe88c810000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c4 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1c4 sendbuff 0x7fe88678f000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c5 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1c5 sendbuff 0x7fe898912000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c6 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1c6 sendbuff 0x7fe892891000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c7 sendbuff 0x7fd196a14000 recvbuff 0x7fc3ca000000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c8 sendbuff 0x7fd190993000 recvbuff 0x7fd1e858b000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1c9 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1c9 sendbuff 0x7fe8b6081000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ca sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ca sendbuff 0x7fe8b0000000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1cb sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1cb sendbuff 0x7f259a183000 recvbuff 0x7f1a54000000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1cc sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1cc sendbuff 0x7f2594102000 recvbuff 0x7f25d060c000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1cd sDGXDGXH100:2058:2058 [2] NCCL INFO AllGather: opCount 1cd sendbuff 0x7f25a6285000 recvbuff 0x7f1a54000000 count 101191680 datatype 0 op 0 rDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1ce sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1ce sendbuff 0x7fe8c8204000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1cf sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1cf sendbuff 0x7fe8da387000 recvbuff 0x7fdd70000000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1d0 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1d0 sendbuff 0x7fe8d4306000 recvbuff 0x7fe8f860c000 count 101191680 datatype 0 op 0 rootDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1d1 sendbuff 0x7fd1d6408000 recvbuff 0x7fc3ca000000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1d2 sendbuff 0x7fd1d0387000 recvbuff 0x7fd1e858b000 count 101191680 datatype 0 op 0 root 0 comm 0x558a79c0d190 [nranks=8] stream 0x558a76b4c9f0
DGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1d3 sendbuff 0x7fDGXH100:2061:2061 [5] NCCL INFO AllGather: opCount 1d3 sendbuff 0x7f299e70e000 recvbuff 0x7f1e68408000 count 101191680 datatypeDGXH100:2062:2062 [6] NCCL INFO AllGather: opCount 1d4 sDGXH100:2059:2059 [3] NCCL INFO AllGather: opCount 1d4 sendbuff 0x7febd4000000 recvbuff 0x7fdc4c000000 count 66592768 datatype 0 op 0 rootDGXH100:2062:5592 [6] NCCL INFO [Service thread] Connection closed by localRankDGXH100:2060:5594 [4] NCCL INFO [Service thread] Connection closed by localRank 3
DGXH100:2060:5594 [4] NCCL INFO [Service thread] Connection closed by localRank 5
DGXDGXH100:2062:2165 [6] NCCL INFO NVLS Unbind MC handle 7fd2dDGXH100:2058:2167 [2] NCCL INFO NVLS Unbind MC handle 7f26b46c33e0 size 1610612736 dev 2
DGXH100:2058:2167 [2] NCCL INFO NVLS Unmap mem UC handle 0x7f26b46c3bd0(0xa20000000) MC handle 0x7f26b46c33e0(0xa80000000)
DGXH100:2058:2167 [2] NCCL INFO comm 0x563ffa9eec60 rank 2 nranks 8 cudaDev 2 busId 52000 - Abort COMPLETE
COMPLETE
t COMPLETE
191680 datatype 0 op 0 root 0 comm 0x55ffaad3d9d0 [nranks=8] stream 0x55ff621148b0
DGXH100:2057:2057 [1] NCCL INFO AllGather: opCount 1d1 sendbuff 0x7f914c60c000 recvbuff 0x7f8608408000 count 101191680 datatype 0 op 0 root 0 comm 0x55ffaad3d9d0 [nranks=8] stream 0x55ff621148b0
DGXH100:2057:2057 [1] NCCL INFO AllGather: opCount 1d2 sendbuff 0x7f914658b000 recvbuff 0x7f85d8000000 count 101191680 datatype 0 op 0 root 0 comm 0x55ffaad3d9d0 [nranks=8] stream 0x55ff621148b0
DGXH100:2057:2057 [1] NCCL INFO AllGather: opCount 1d3 sendbuff 0x7f915870e000 recvbuff 0x7f8608408000 count 101191680 datatype 0 op 0 root 0 comm 0x55ffaad3d9d0 [nranks=8] stream 0x55ff621148b0
DGXH100:2057:2057 [1] NCCL INFO AllGather: opCount 1d4 sendbuff 0x7f941e000000 recvbuff 0x7f84b4000000 count 66592768 datatype 0 op 0 root 0 comm 0x55ffaad3d9d0 [nranks=8] stream 0x55ff621148b0
DGXH100:2057:5598 [1] NCCL INFO [Service thread] Connection closed by localRank 1
DGXH100:2057:5598 [1] NCCL INFO [Service thread] Connection closed by localRank 2

This is a log contained in DEBUG_FILE.

takumi_OSHITA · March 6, 2024, 7:48am

[rank4]:[E ProcessGroupNCCL.cpp:523] [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank5]:[E ProcessGroupNCCL.cpp:523] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank1]:[E ProcessGroupNCCL.cpp:523] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054175 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank2]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
[rank3]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank3]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank3]:[E ProcessGroupNCCL.cpp:1182] [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7feee6f936e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7feee6f96c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7feee6f97839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 3] NCCL watchdog thread terminated with exception: [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7feee6f936e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7feee6f96c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7feee6f97839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7feee5debd87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7feee6cedb11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fef30cb6e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fef31fba609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fef31d7b353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

/opt/conda/envs/develop/lib/python3.10/site-packages/torch/utils/checkpoint.py:460: UserWarning: torch.utils.checkpoint: please pass in use_reentrant=True or use_reentrant=False explicitly. The default value of use_reentrant will be updated to be False in the future. To maintain current behavior, pass use_reentrant=True. It is recommended that you use use_reentrant=False. Refer to docs for more details on the differences between the two variants.
warnings.warn(
[rank4]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank4]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank4]:[E ProcessGroupNCCL.cpp:1182] [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa83ca606e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa83ca63c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa83ca64839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 4] NCCL watchdog thread terminated with exception: [Rank 4] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fa83ca606e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fa83ca63c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fa83ca64839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fa83b8b8d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fa83c7bab11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fa886783e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fa887a87609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fa887848353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank5]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank5]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank5]:[E ProcessGroupNCCL.cpp:1182] [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2f763086e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2f7630bc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f7630c839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 5] NCCL watchdog thread terminated with exception: [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2f763086e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2f7630bc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2f7630c839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2f75160d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f2f76062b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f2fc002be95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7f2fc132f609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f2fc10f0353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank6]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank6]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank6]:[E ProcessGroupNCCL.cpp:1182] [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd7db1ce6e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd7db1d1c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd7db1d2839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 6] NCCL watchdog thread terminated with exception: [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7fd7db1ce6e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7fd7db1d1c3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7fd7db1d2839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7fd7da026d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7fd7daf28b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7fd824ef1e95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7fd8261f5609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fd825fb6353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[rank2]:[E ProcessGroupNCCL.cpp:537] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank2]:[E ProcessGroupNCCL.cpp:543] To avoid data inconsistency, we are taking the entire process down.
[rank2]:[E ProcessGroupNCCL.cpp:1182] [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2bbef496e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2bbef4cc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2bbef4d839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

terminate called after throwing an instance of ‘c10::DistBackendError’
what(): [Rank 2] NCCL watchdog thread terminated with exception: [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=466, OpType=_ALLGATHER_BASE, NumelIn=25297920, NumelOut=202383360, Timeout(ms)=600000) ran for 1054176 milliseconds before timing out.
Exception raised from checkTimeout at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:525 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1e6 (0x7f2bbef496e6 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::workCleanupLoop() + 0x19d (0x7f2bbef4cc3d in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x119 (0x7f2bbef4d839 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #4: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #5: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

Exception raised from ncclCommWatchdog at …/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1186 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x57 (0x7f2bbdda1d87 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libc10.so)
frame #1: + 0xdf6b11 (0x7f2bbeca3b11 in /opt/conda/envs/develop/lib/python3.10/site-packages/torch/lib/libtorch_cuda.so)
frame #2: + 0xd3e95 (0x7f2c08c6ce95 in /opt/conda/envs/develop/bin/…/lib/libstdc++.so.6)
frame #3: + 0x8609 (0x7f2c09f70609 in /usr/lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f2c09d31353 in /usr/lib/x86_64-linux-gnu/libc.so.6)

[2024-03-06 16:30:46,076] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2056 closing signal SIGTERM
[2024-03-06 16:30:46,077] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2057 closing signal SIGTERM
[2024-03-06 16:30:46,078] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 2063 closing signal SIGTERM
[2024-03-06 16:30:47,057] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: -6) local_rank: 2 (pid: 2058) of binary: /opt/conda/envs/develop/bin/python3.10
Traceback (most recent call last):
File “/opt/conda/envs/develop/bin/accelerate”, line 8, in
sys.exit(main())
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/accelerate_cli.py”, line 47, in main
args.func(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 1010, in launch_command
multi_gpu_launcher(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/accelerate/commands/launch.py”, line 672, in multi_gpu_launcher
distrib_run.run(args)
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/run.py”, line 803, in run
elastic_launch(
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 135, in call
return launch_agent(self._config, self._entrypoint, list(args))
File “/opt/conda/envs/develop/lib/python3.10/site-packages/torch/distributed/launcher/api.py”, line 268, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:

mllm/pipeline/finetune.py FAILED

and this is a Error.