DPP no_sync not equivalent

Hi there,

Just wanted to check some logic, I’m seeing cases where the gradients for my model using DDP with no_sync set aren’t equivalent to DDP without no_sync set when using gradient accumulation. I’m seeing this in the 2nd decimal place of the gradient norm which is obviously quite a small difference, so my guess is that we’re hitting up against some of the limitations of floating point and because we’re now summing the gradients in a slightly different order, this is causing differences.

Does this make sense as an explanation? Is there anything else that I could be missing?

Thanks!

This seems somewhat reasonable to me. Is this using fp32 gradients?