Does Batchnorm work for gradient accumulation?

A popular technique that claims to reduce model RAM requirements is gradient accumulation. However, from what I could gather, it seems that it cannot be used in real world applications as it is incompatible with batchnorm.
See the discussion here:

My question is, are there are any recent proven workarounds or fixes that have been used in production environments that would let one use gradient accumulation?

1 Like

You can change batchnorm to RunningBatchNorm which was proposed by Jeremy Howard in fastai course. It removes the dependence on batch size.

1 Like

What about GroupNorm?

That is also one option.