Error stack trace:
ddp$ python3 multi.py 50 10
Starting trainer on device ID: 1
Starting trainer on device ID: 0
machine01:124359:124359 [0] NCCL INFO Bootstrap : Using eno1:10.131.27.116<0>
machine01:124359:124359 [0] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
machine01:124359:124359 [0] NCCL INFO cudaDriverVersion 12040
NCCL version 2.20.5+cuda11.8
machine01:124359:124359 [0] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7fb60f800000
machine01:124360:124360 [1] NCCL INFO cudaDriverVersion 12040
machine01:124360:124360 [1] NCCL INFO Bootstrap : Using eno1:10.131.27.116<0>
machine01:124360:124360 [1] NCCL INFO NET/Plugin : dlerror=libnccl-net.so: cannot open shared object file: No such file or directory No plugin found (libnccl-net.so), using internal implementation
machine01:124360:124360 [1] NCCL INFO init.cc:1732 Cuda Host Alloc Size 4 pointer 0x7f5f79800000
machine01:124360:124432 [1] NCCL INFO NET/IB : No device found.
machine01:124360:124432 [1] NCCL INFO NET/Socket : Using [0]eno1:10.131.27.116<0>
machine01:124360:124432 [1] NCCL INFO Using non-device net plugin version 0
machine01:124360:124432 [1] NCCL INFO Using network Socket
machine01:124359:124431 [0] NCCL INFO NET/IB : No device found.
machine01:124359:124431 [0] NCCL INFO NET/Socket : Using [0]eno1:10.131.27.116<0>
machine01:124359:124431 [0] NCCL INFO Using non-device net plugin version 0
machine01:124359:124431 [0] NCCL INFO Using network Socket
machine01:124360:124432 [1] NCCL INFO comm 0x5623f924e240 rank 1 nranks 2 cudaDev 1 nvmlDev 1 busId 81000 commId 0x81bbd591fef00891 - Init START
machine01:124359:124431 [0] NCCL INFO comm 0x5583e9baa540 rank 0 nranks 2 cudaDev 0 nvmlDev 0 busId 1000 commId 0x81bbd591fef00891 - Init START
machine01:124360:124432 [1] NCCL INFO NET/Socket : GPU Direct RDMA Disabled for HCA 0 'eno1'
machine01:124360:124432 [1] NCCL INFO === System : maxBw 24.0 totalBw 24.0 ===
machine01:124360:124432 [1] NCCL INFO CPU/0 (1/2/-1)
machine01:124360:124432 [1] NCCL INFO + PCI[24.0] - GPU/1000 (0)
machine01:124360:124432 [1] NCCL INFO + PCI[24.0] - GPU/81000 (1)
machine01:124360:124432 [1] NCCL INFO + PCI[0.8] - NIC/C2000
machine01:124360:124432 [1] NCCL INFO ==========================================
machine01:124360:124432 [1] NCCL INFO GPU/1000 :GPU/1000 (0/5000.000000/LOC) GPU/81000 (2/24.000000/PHB) CPU/0 (1/24.000000/PHB)
machine01:124360:124432 [1] NCCL INFO GPU/81000 :GPU/1000 (2/24.000000/PHB) GPU/81000 (0/5000.000000/LOC) CPU/0 (1/24.000000/PHB)
...
machine01:124359:124359 [0] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
machine01:124360:124360 [1] NCCL INFO AllGather: opCount 0 sendbuff 0x7f5f79600400 recvbuff 0x7f5f79600a00 count 8 datatype 0 op 0 root 0 comm 0x5623f924e240 [nranks=2] stream 0x5623f924e170
machine01:124359:124359 [0] NCCL INFO 16 Bytes -> Algo 1 proto 0 time 7.601000
machine01:124360:124360 [1] NCCL INFO misc/utils.cc:235 memory stack hunk malloc(65536)
[rank1]:[E913 22:56:54.397891680 ProcessGroupNCCL.cpp:607] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank1]:[E913 22:56:54.398061382 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 1] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E913 22:56:54.398077022 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 1] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank1]:[E913 22:56:54.398083332 ProcessGroupNCCL.cpp:621] [Rank 1] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank1]:[E913 22:56:54.398088912 ProcessGroupNCCL.cpp:627] [Rank 1] To avoid data inconsistency, we are taking the entire process down.
[rank0]:[E913 22:56:54.398202694 ProcessGroupNCCL.cpp:607] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
[rank0]:[E913 22:56:54.398332895 ProcessGroupNCCL.cpp:1664] [PG 0 (default_pg) Rank 0] Exception (either an error or timeout) detected by watchdog at work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E913 22:56:54.398342595 ProcessGroupNCCL.cpp:1709] [PG 0 (default_pg) Rank 0] Timeout at NCCL work: 1, last enqueued NCCL work: 1, last completed NCCL work: -1.
[rank0]:[E913 22:56:54.398348925 ProcessGroupNCCL.cpp:621] [Rank 0] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
[rank0]:[E913 22:56:54.398354025 ProcessGroupNCCL.cpp:627] [Rank 0] To avoid data inconsistency, we are taking the entire process down.
[rank1]:[E913 22:56:54.399640210 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6051fe2f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6051fe9943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6051febd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
[rank0]:[E913 22:56:54.399892952 ProcessGroupNCCL.cpp:1515] [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb6ec552f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb6ec559943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb6ec55bd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)
terminate called after throwing an instance of 'c10::DistBackendError'
what(): [PG 0 (default_pg) Rank 1] Process group watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7f6051fe2f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7f6051fe9943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7f6051febd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f6050cf5f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe52446 (0x7f6051c75446 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x7f60b46cfb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7f60bc655609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7f60bc420353 in /lib/x86_64-linux-gnu/libc.so.6)
what(): [PG 0 (default_pg) Rank 0] Process group watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLGATHER, NumelIn=1, NumelOut=2, Timeout(ms)=600000) ran for 600018 milliseconds before timing out.
Exception raised from checkTimeout at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:609 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: c10d::ProcessGroupNCCL::WorkNCCL::checkTimeout(std::optional<std::chrono::duration<long, std::ratio<1l, 1000l> > >) + 0x1d2 (0x7fb6ec552f02 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: c10d::ProcessGroupNCCL::watchdogHandler() + 0x233 (0x7fb6ec559943 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #3: c10d::ProcessGroupNCCL::ncclCommWatchdog() + 0x10c (0x7fb6ec55bd2c in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #4: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #5: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #6: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)
Exception raised from ncclCommWatchdog at /opt/conda/conda-bld/pytorch_1720538621320/work/torch/csrc/distributed/c10d/ProcessGroupNCCL.cpp:1521 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7fb6eb265f86 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0xe52446 (0x7fb6ec1e5446 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/libtorch_cuda.so)
frame #2: <unknown function> + 0xd3b55 (0x7fb74ec3fb55 in /home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/lib/../../../.././libstdc++.so.6)
frame #3: <unknown function> + 0x8609 (0x7fb756bc5609 in /lib/x86_64-linux-gnu/libpthread.so.0)
frame #4: clone + 0x43 (0x7fb756990353 in /lib/x86_64-linux-gnu/libc.so.6)
W0913 22:56:55.349663 140695308830528 torch/multiprocessing/spawn.py:146] Terminating process 124359 via signal SIGTERM
Traceback (most recent call last):
File "/home/alonso/multi.py", line 120, in <module>
mp.spawn(main, args=(world_size, args.save_every, args.total_epochs, args.batch_size), nprocs=world_size)
File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 282, in spawn
return start_processes(fn, args, nprocs, join, daemon, start_method="spawn")
File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 238, in start_processes
while not context.join():
File "/home/alonso/conda/envs/gpu/lib/python3.9/site-packages/torch/multiprocessing/spawn.py", line 170, in join
raise ProcessExitedException(
torch.multiprocessing.spawn.ProcessExitedException: process 1 terminated with signal SIGABRT