A popular technique that claims to reduce model RAM requirements is gradient accumulation. However, from what I could gather, it seems that it cannot be used in real world applications as it is incompatible with batchnorm.
See the discussion here:
My question is, are there are any recent proven workarounds or fixes that have been used in production environments that would let one use gradient accumulation?