Gradient accumulation causes worse generalization

I notice that when I accumulate gradients there is consistently worse generalization.
(by small number), e.g perplexity 12.24 vs 12.02

This happens especially when combining it with checkpointing, I did not notice checkpointing alone harms generalization for my models.

I checked this for several tasks:
I tried it in deterministic and non-deterministic cudnn modes, I tried it with LayerNorm to avoid batch norm problems.

I wonder if it could be about LayerNorm epsilon, maybe I should change it when I scale the loss for gradient accumulation? or maybe its an overflow?

Any suggestions would be appreciated

Hi,

One thing to check with checkpointing is that at least one input to the module needs to require gradients. So if it is the first module in your net, and your input does not require gradient, it can prevent the first layer from training. Setting the input to require gradient is a simple workaround.

I notice that when I accumulate gradients there is consistently worse generalization

What do you mean by accumulate gradients? You mean using .backward() vs autograd.grad() ?

But as long as there is some nn.Parameter the result will require grad so not sure why is that necessary?

By accomulating gradients I mean several
forward and backward passes without calling zero_grad between them.
The gradients are accomulated in .grad field.
It’s used for example to simulate bigger batches (together with loss scaling).

But as long as there is some nn.Parameter the result will require grad so not sure why is that necessary?

The output of the layer yes. But for the first layer, the input does not. And it is a limitation of the checkpoint module that at least one input needs to require gradient if you want to be able to backprop through it.

By accomulating gradients I mean several
forward and backward passes without calling zero_grad between them.

No it should not make any difference.