Training hangs on loss.backward() with DDP --nnodes=2 --nproc_per_node=3

Did you make sure the same number of batches is used on each rank as described here?