Does Batchnorm work for gradient accumulation?

Martin_cb · August 7, 2020, 9:39pm

A popular technique that claims to reduce model RAM requirements is gradient accumulation. However, from what I could gather, it seems that it cannot be used in real world applications as it is incompatible with batchnorm.
See the discussion here:

My question is, are there are any recent proven workarounds or fixes that have been used in production environments that would let one use gradient accumulation?

Kushaj · August 7, 2020, 11:21pm

You can change batchnorm to RunningBatchNorm which was proposed by Jeremy Howard in fastai course. It removes the dependence on batch size.

Martin_cb · August 8, 2020, 10:35am

What about GroupNorm?

Kushaj · August 9, 2020, 9:58pm

That is also one option.