I would like to train a resnetxxx model on very large images (~3000x3000) in a domain where downscaling loses relevant information. My GPU only has memory enough for a single image at a time, so I look into gradient accumulation, discussed e.g. here. This seems to work fine for the gradient itself, but is it possible to distribute the batch norm calculations over several (micro-)batches also?
No, I don’t think batchnorm layers have this functionality added (in case you have multiple GPUs, you could try
SyncBatchNorm). For a single GPU, you could try to change the
momentum and see, if you could smooth the running estimates.
I will try to vary the normalization momentum.
Ideally, I would like to reduce the batch size to 1, and then the minibatch standard deviations cannot even be estimated for a single batch, so neither momentum adjustment nor syncing over GPUs is meaningful. This is why I would like to estimate the normalization parameters (mu and sd) over ‘batches of minibatches’.