You could use a smaller batch size and accumulate the gradients. Then after a few iterations you could update the parameters using your optimizer.
Have a look at the 2nd option in this post.
It would yield the same behavior regarding the gradients, but note that other layers like BatchNorm
will behave differently, since they see smaller batches.
If that’s problematic, e.g. when your batch size is really small, then you could change the momentum a bit or use other normalization layers, e.g. GroupNorm
which should be more stable regarding smaller batch sizes.