Why do workers end up with different loss values?

Ah! This is indeed what you meant in DistributedDataParallel loss compute and backpropogation?. Pasting my answer there here as well for posterity and the indexers.

Each process computes its own output, using its own input, with its own activations, and computes its own loss. Then on loss.backward() all processes reduce their gradients. As loss.backward() returns, the gradients of your model parameters will be the same, and the optimizer in each process will perform the exact same update to the model parameters.

Note that this is only the case if you use torch.nn.parallel.DistributedDataParallel. If you don’t, you’ll need to take care of gradient synchronization yourself.