I’m doing semantic segmentation (for building detection) using a library with a unet implementation. It has BatchNorm2d in most stages.
The layers get the following configuration:
BatchNorm2d(X, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
where X depend on layer.
I get very different result for evaluation and training and the evaluation is a little random:
train: 100%|██████████| 4894/4894 [39:03<00:00, 2.09it/s, mseloss - 0.002174, iou_score - 0.8588]
valid: 100%|██████████| 572/572 [00:13<00:00, 41.39it/s, mseloss - 0.01254, iou_score - 0.3587]
train: 100%|██████████| 4894/4894 [39:03<00:00, 2.09it/s, mseloss - 0.002169, iou_score - 0.8582]
valid: 100%|██████████| 572/572 [00:13<00:00, 41.51it/s, mseloss - 0.01897, iou_score - 0.001748]
(Train has 8 samples per batch, valid has 1)
Are there any recommendations for using BatchNorm2d to get similar behaviour in training and evaluation mode?
If I evaluate in training mode, then I get better results and the network have more similar environment. Why isn’t there a similar mode for evaluation?
If I evaluate in training mode but with torch.no_grad(), does the running stats update and get used and what does happen to the affine values?
Could I change track_running_stats=False when using the model in evaluation and use more samples in the batch or larger images? (Is there an easy way to change that for one layer if I load a model? (I can iterate the layers.))
Could I warm up the batch normalization for the evaluation before looking at the results?
Should I try a larger momentum so that the affine isn’t too influenced by the learning last iteration?
Thanks for loking at this!