Gradient accumulation gives different results compared to full batch

I’m building an image captioner using Huggingface models with Pytorch and I’m getting different results for the first iteration (and for the following iterations obviously) when the effective batch size is the same (mini batch of 4 with 16 gradient accumulation steps vs 64 batch size vs any other combination that leads to 64 effective batch size).

For a batch size of 16 with 4 gradient accumulation steps the loss of the first iteration is 4.095348954200745 and for a batch size of 4 with 16 gradient accumulation steps the loss of the first iteration is 4.097771629691124.

Here’s a gist link to the code I’m running. https://gist.github.com/miguelscarv/55c7ce58c911a743fd54be258e3e0d9c

I’ve also included the vision_encoder_decoder.py code because I had to make some slight modifications for it to accept CLIP models.

I’ve disabled Dropout, set the seed, didn’t shuffle the dataloader, and the models I’m using don’t have batch specific layers like batchnorm layers as far as I know. The model’s im using are gpt2 and laion/CLIP-ViT-L-14-DataComp.XL-s13B-b90K (in case you want to look at the models). I also don’t think it’s a precision issue because in the code I look at the sum of the loss divided by the grad acc steps vs the loss divided by the grad acc steps and them summed and it leads to the exact same result…

In the same gist I added an attempt at re creating this issue with pure Pytorch and the MNIST dataset so you can more easily run it. This is the torch_test.py file. The loss difference isn’t as large as the one I just described, but it’s still a difference that shouldn’t be there in my opinion since there are no precision issues (I think)

Do you see the same behavior with other models?

No, at least not to this extent. Changing the configuration of gradient accumulation such that the effective batch size is always the same drastically changes the loss of the first iteration (first full batch size). Generally speaking, the more the number of grad acc performed the bigger the initial loss for batches of the same size. This is for the text_caps and vision_encoder_decoder model. For the Pytorch one there are some very slight changes, but nothing major.

Please, if possible, take a look at the gist and see if there is anything that stands out to you in the text_caps file or in the vision_encoder_decoder file

I think I figured it out. Essentially the “problem” was that I was using mean reduction in my loss when training a model with variable sequence length. If I have 2 sequences, A and B, and sequence A has 7 tokens and sequence B has 10 tokens then I have to add 3 padding tokens to A. The loss of these 2 sequences in a batch would be (loss_A + loss_B)/17. If I was using gradient accumulation with mean reduction it would be loss_A/7 + loss_B/10. It’s easy to tell that these 2 are not the same.

To avoid this issue, would it make sense to change the reduction to sum and only divide the loss by the number of tokens in the effective batch at the end? I think this would more closely (and by that I mean exactly up to precision errors) reflect the “real” loss of the effective batch, right?

You mentioned that you set the random seed, which is a good practice for reproducibility. However, PyTorch operations might not be fully deterministic across all operations, especially if you’re using GPU. To maximize determinism, you can also set the following environment variables before running your script:
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False