I answered this in another post *deadlock* when using torch.distributed.broadcast
Basically when you are adding collectives like broadcast
, please make sure it’s called on all ranks rather than only rank 0 (like all_reduce
in the script), this should resolve this issue. Let me know if it not works.