Ubuntu 24.04 NCCL Seemingly Randomly Timing Out on All Reduce

Tanay_Arora · January 15, 2025, 11:33pm

The servers I run code on have recently migrated to Ubuntu 24.04 (and nvidia driver 565.57.01). This has caused a seemingly random timeout on all reduce calls. With INFO env variables set, the output is below (this specific one ran for 25 epochs before failing). This was under torch-nightly (2.7.0+cu126) but it persists with torch 2.5.1+cu124 (the logs for this are still being produced). What am I doing wrong (this was the same code I used on Ubuntu 22.04).

$device:64154:64154 [0] NCCL INFO AllReduce: opCount 10ecf sendbuff 0x7b069cd5b000 recvbuff 0x7b069cd5b000 count 1 datatype 7 op 0 root 0 comm 0x46044020 [nranks=4] stream 0x4625c980                                                                               
$device:64154:64154 [0] NCCL INFO 4 Bytes -> Algo 1 proto 0 time 12.602401
[rank3]:[E114 22:31:00.813063304 ProcessGroupNCCL.cpp:633] [Rank 3] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69328, OpType=ALLREDUCE, NumelIn=2365450, NumelOut=2365450, Timeout(ms)=600000) ran for 600003 milliseconds before timing out.                                                        
[rank3]:[E114 22:31:00.813861279 ProcessGroupNCCL.cpp:2170] [PG ID 0 PG GUID 0(default_pg) Rank 3]  failure detected by watchdog at work sequence id: 69328 PG status: last enqueued work: 69331, last completed work: 69327
[rank3]:[E114 22:31:00.813915780 ProcessGroupNCCL.cpp:671] Stack trace of the failed collective not found, potentially because FlightRecorder is disabled. You can enable it by setting TORCH_NCCL_TRACE_BUFFER_SIZE to a non-zero value.
[rank2]:[E114 22:31:00.850481621 ProcessGroupNCCL.cpp:633] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=69328, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=600000) ran for 600032 milliseconds before timing out.                                                                    [rank2]:[E114 22:31:00.851808890 ProcessGroupNCCL.cpp:2170] [PG ID 0 PG GUID 0(default_pg) Rank 2]  failure detected by watchdog at work sequence id: 69328 PG status: last enqueued work: 69328, last completed work: 69327                                                                                            [rank2]:[E114 22:31:00.852806006 ProcessGroupNCCL.cpp:668] Stack trace of the failed collective:                                                            
#0 barrier from /path/to/torch/distributed/distributed_c10d.py:4551                  
#1 wrapper from /path/to/torch/distributed/c10d_logger.py:81