How to implement accumulated gradient？

ptrblck · November 26, 2018, 12:15pm

Your general approach is right and I also assumed than BatchNorm layers might be a problem in this case.
If you just have very few samples in each forward pass, you could use InstanceNorm or GroupNorm instead, which should work better for small batch sizes.
Alternatively, you could also try to change the momentum of BatchNorm, but I’m not sure, if that will really help a lot.