Does Batchnorm work for gradient accumulation?

A popular technique that claims to reduce model RAM requirements is gradient accumulation. However, from what I could gather, it seems that it cannot be used in real world applications as it is incompatible with batchnorm.
See the discussion here:

My question is, are there are any recent proven workarounds or fixes that have been used in production environments that would let one use gradient accumulation?

You can change batchnorm to RunningBatchNorm which was proposed by Jeremy Howard in fastai course. It removes the dependence on batch size.

1 Like

What about GroupNorm?

That is also one option.