Hi there,
Just wanted to check some logic, I’m seeing cases where the gradients for my model using DDP with no_sync
set aren’t equivalent to DDP without no_sync
set when using gradient accumulation. I’m seeing this in the 2nd decimal place of the gradient norm which is obviously quite a small difference, so my guess is that we’re hitting up against some of the limitations of floating point and because we’re now summing the gradients in a slightly different order, this is causing differences.
Does this make sense as an explanation? Is there anything else that I could be missing?
Thanks!