Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=COALESCED

haocong_ma · March 20, 2025, 8:11am

Before posting a query, check the FAQs - it might already be answered!

I use two gpu machine with total 2T memory. I install enviroment accroding to the guide “Shocking Release! DeepSeek 671B Fine-Tuning Guide Revealed—Unlock the Upgraded DeepSeek Suite with One Click, AI Players Ecstatic!”

Then I installed the colossalai components, and test colossalai by using examples/language/llama/benchmark.py with the cmd:
‘colossalai run --nproc_per_node 8 --host 192.168.112.69,192.168.112.61 --master_addr 192.168.112.69 benchmark.py -g -x -b 6’.

Next I test ncc-test with the cmd:
‘all_reduce_perf -b 1M -e 32M -f 2 -g 8 -c 0’, they are all seems work well.

Then I run lora_finetune.py by using torchrun with this command:
TORCH_LOGS=“+all” TORCH_NCCL_TRACE_BUFFER_SIZE=2000 TORCH_NCCL_DUMP_ON_TIMEOUT=true TORCH_NCCL_DEBUG_INFO_TEMP_FILE=/ssd1/nccl_trace_rank_ TORCH_NCCL_TRACE_CPP_STACK=true TORCH_NCCL_ENABLE_TIMING=true NCCL_IB_GID_INDEX=3 NCCL_IB_DISABLE=0 NCCL_DEBUG_SUBSYS=INIT,ENV,NET,GRAPH NCCL_DEBUG=INFO NCCL_DEBUG_FILE=/ssd1/log/nccl.log CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 /usr/bin/python3 /usr/local/bin/torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0 --master_addr=192.168.112.69 --master_port=29500 lora_finetune.py --pretrained /ssd1/model/DeepSeek-R1-bf16/ --dataset /ssd1/dataset/lora_sft_data.jsonl --plugin moe --lr 2e-5 --max_length 256 -g --ep 8 --pp 2 --batch_size 24 --lora_rank 8 --lora_alpha 16 --num_epochs 2 --warmup_steps 8 --tensorboard_dir logs --save_dir /ssd1/model/DeepSeek-R1-bf16-lora

The mode weights are converted by fp8_cast_bf16.py, and dataset is using lora_sft_data.jsonl.

I run the script and got message like this:
‘[Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600036 milliseconds before timing out.’.

whole output of torchrun:
warnings.warn(
/usr/local/lib/python3.10/dist-packages/colossalai/shardformer/layer/normalization.py:339: UserWarning: Module replacement failed. Please install apex from source (GitHub - NVIDIA/apex: A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch) to use the fused RMS normalization kernel
warnings.warn(
Step: 0it [00:00, ?it/s]
Step: 0it [00:00, ?it/s]
[rank15]:I0319 18:41:39.339000 3662096 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank8]:I0319 18:42:10.698000 3662089 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank9]:I0319 18:43:38.481000 3662090 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank14]:I0319 18:43:43.324000 3662095 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank10]:I0319 18:45:50.405000 3662091 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank13]:I0319 18:45:54.202000 3662094 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank12]:I0319 18:47:21.447000 3662093 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank11]:I0319 18:47:57.059000 3662092 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank11]:I0319 18:48:00.937000 3662092 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank12]:I0319 18:48:01.107000 3662093 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank9]:I0319 18:48:01.114000 3662090 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank10]:I0319 18:48:01.124000 3662091 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank15]:I0319 18:48:01.136000 3662096 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank13]:I0319 18:48:01.137000 3662094 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank14]:I0319 18:48:01.139000 3662095 torch/distributed/distributed_c10d.py:815] Using device cuda for object collectives.
[rank15]:[E319 18:56:07.313305267 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 15] Observed flight recorder dump signal from another rank via TCPStore.
[rank15]:[E319 18:56:07.313490064 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 15] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank15]:[E319 18:56:07.313685929 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 15] ProcessGroupNCCL preparing to dump debug info.
[rank12]:[E319 18:56:07.382927380 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 12] Observed flight recorder dump signal from another rank via TCPStore.
[rank12]:[E319 18:56:07.383084439 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 12] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank12]:[E319 18:56:07.383248286 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 12] ProcessGroupNCCL preparing to dump debug info.
[rank13]:[E319 18:56:08.623856716 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 13] Observed flight recorder dump signal from another rank via TCPStore.
[rank13]:[E319 18:56:08.624036137 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 13] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank13]:[E319 18:56:08.624235262 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 13] ProcessGroupNCCL preparing to dump debug info.
[rank14]:[E319 18:56:08.654622039 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 14] Observed flight recorder dump signal from another rank via TCPStore.
[rank14]:[E319 18:56:08.654784092 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 14] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank14]:[E319 18:56:08.654973284 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 14] ProcessGroupNCCL preparing to dump debug info.
[rank11]:[E319 18:56:08.695592459 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 11] Observed flight recorder dump signal from another rank via TCPStore.
[rank11]:[E319 18:56:08.695823267 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 11] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank11]:[E319 18:56:08.695998310 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 11] ProcessGroupNCCL preparing to dump debug info.
[rank10]:[E319 18:56:08.705166056 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 10] Observed flight recorder dump signal from another rank via TCPStore.
[rank10]:[E319 18:56:08.705384613 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 10] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank10]:[E319 18:56:08.705548805 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 10] ProcessGroupNCCL preparing to dump debug info.
[rank8]:[E319 18:56:08.713736005 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 8] Observed flight recorder dump signal from another rank via TCPStore.
[rank8]:[E319 18:56:08.713962485 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 8] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank8]:[E319 18:56:08.714134674 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 8] ProcessGroupNCCL preparing to dump debug info.
[rank9]:[E319 18:56:08.755776177 ProcessGroupNCCL.cpp:1423] [PG ID 0 PG GUID 0 Rank 9] Observed flight recorder dump signal from another rank via TCPStore.
[rank9]:[E319 18:56:08.755926763 ProcessGroupNCCL.cpp:1484] [PG ID 0 PG GUID 0 Rank 9] Received a dump signal due to a collective timeout from rank 6 and we will try our best to dump the debug info. Last enqueued NCCL work: -1, last completed NCCL work: -1.This is most likely caused by incorrect usages of collectives, e.g., wrong sizes used across ranks, the order of collectives is not same for all ranks or the scheduled collective, for some reason, didn’t run. Additionally, this can be caused by GIL deadlock or other reasons such as network errors or bugs in the communications library (e.g. NCCL), etc.
[rank9]:[E319 18:56:08.756072031 ProcessGroupNCCL.cpp:1288] [PG ID 0 PG GUID 0 Rank 9] ProcessGroupNCCL preparing to dump debug info.
[rank13]:[E319 18:58:00.826432549 ProcessGroupNCCL.cpp:616] [Rank 5] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600064 milliseconds before timing out.
[rank13]:[E319 18:58:00.826527099 ProcessGroupNCCL.cpp:1785] [PG ID 3 PG GUID 26 Rank 5] Exception (either an error or timeout) detected by watchdog at work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2.
[rank15]:[E319 18:58:00.841515160 ProcessGroupNCCL.cpp:616] [Rank 7] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600079 milliseconds before timing out.
[rank15]:[E319 18:58:00.841590933 ProcessGroupNCCL.cpp:1785] [PG ID 3 PG GUID 26 Rank 7] Exception (either an error or timeout) detected by watchdog at work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2.
[rank14]:[E319 18:58:00.853505553 ProcessGroupNCCL.cpp:616] [Rank 6] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3, OpType=COALESCED, NumelIn=18446744073709551615, NumelOut=18446744073709551615, Timeout(ms)=600000) ran for 600091 milliseconds before timing out.
[rank14]:[E319 18:58:00.853608228 ProcessGroupNCCL.cpp:1785] [PG ID 3 PG GUID 26 Rank 6] Exception (either an error or timeout) detected by watchdog at work: 3, last enqueued NCCL work: 3, last completed NCCL work: 2.
[rank11]:[E319 18:59:02.335480159 ProcessGroupNCCL.cpp:1154] [PG ID 3 PG GUID 26 Rank 3] Future for ProcessGroup abort timed out after 600000 ms
[rank10]:[E319 18:59:17.864130764 ProcessGroupNCCL.cpp:1154] [PG ID 3 PG GUID 26 Rank 2] Future for ProcessGroup abort timed out after 600000 ms
[rank12]:[E319 18:59:18.536762189 ProcessGroupNCCL.cpp:1154] [PG ID 3 PG GUID 26 Rank 4] Future for ProcessGroup abort timed out after 600000 ms
[rank9]:[E319 18:59:18.538266323 ProcessGroupNCCL.cpp:1154] [PG ID 3 PG GUID 26 Rank 1] Future for ProcessGroup abort timed out after 600000 ms

My Environment:

torch Version: 2.5.1
nccl：2.25.1
Python 3.10.12
CUDA Version: 12.4
colossalai: v0.4.9
Driver Version: 550.144.03

I try to use perf and py-spy to get stack, but there is only many stack about multiprocessing.
I also used flight recorder to dump some rank info, the result is like this:

python3 ./flight_recorder/fr_trace.py --prefix “nccl_trace_rank_” /ssd1/torchtrace/

loaded 16 files in 0.07902264595031738s
built groups, memberships
Traceback (most recent call last):
File “/ssd1/pytorch/tools/./flight_recorder/fr_trace.py”, line 51, in
main()
File “/ssd1/pytorch/tools/./flight_recorder/fr_trace.py”, line 44, in main
db = build_db(details, args, version)
File “/ssd1/pytorch/tools/flight_recorder/components/builder.py”, line 444, in build_db
tracebacks, collectives, nccl_calls = build_collectives(
File “/ssd1/pytorch/tools/flight_recorder/components/builder.py”, line 219, in build_collectives
if find_coalesced_group(pg_name, entries, _pg_guids, first_rank):
File “/ssd1/pytorch/tools/flight_recorder/components/utils.py”, line 208, in find_coalesced_group
assert found[-1][1][“profiling_name”] == “nccl:coalesced”
AssertionError

There are 16 rank, but only rank0 and rank7 have output, like this:

python3 ./flight_recorder/fr_trace.py --prefix “nccl_trace_rank_” /ssd1/torchtrace/ -j

loaded 16 files in 0.07946634292602539s
built groups, memberships
Rank 0 Rank 1 Rank 2 Rank 3 Rank 4 Rank 5 Rank 6 Rank 7 Rank 8 Rank 9 Rank 10 Rank 11 Rank 12 Rank 13 Rank 14 Rank 15

recv(s=6 d=0, sz=[[7168, 8]], state=scheduled) recv(s=13 d=8, sz=[[8, 7168]], state=scheduled)
recv(s=6 d=0, sz=[[8, 7168]], state=scheduled) recv(s=13 d=8, sz=[[2048, 8]], state=scheduled)
…
recv(s=6 d=0, sz=[[8, 2048]], state=scheduled)
recv(s=6 d=0, sz=[[7168, 8]], state=scheduled)
coalesced(input_sizes=None, state=completed)