How to implement accumulated gradient?

Your general approach is right and I also assumed than BatchNorm layers might be a problem in this case.
If you just have very few samples in each forward pass, you could use InstanceNorm or GroupNorm instead, which should work better for small batch sizes.
Alternatively, you could also try to change the momentum of BatchNorm, but I’m not sure, if that will really help a lot.

3 Likes