Batch size memory problem and normalization across batches

I am a SW eng trying to learn pytorch and deep learning.
I am trying to train a resnext50_32x4d model.
I have limited resources; so I can not increase the batch size to more than 16.
But just to try I used a high capacity cloud machine for a single day and increased batch size to 128 and I got a lot better results.
I think this is because of the normalization health.
I am ok with the slow training because of the limited resources but I am not ok with the normalization defect.
Is there a way to do the batch normalization across 8 batches or something like that?

Thanks in advance!

Edit: Not sure if I posted according to the forum rules; not sure which category to select. Please feel free to let me know if I did wrong.

You might try to change the momentum of your batch norm layers, so that the running estimates will be smoothed more.
Besides that you could try to trade compute for memory using torch.utils.checkpoint.


Thanks @ptrblck playing with optimizer parameters solved the problem.