Training hangs on loss.backward() with DDP --nnodes=2 --nproc_per_node=3

ptrblck · May 18, 2025, 2:52pm

Did you make sure the same number of batches is used on each rank as described here?