Normalizing batchNorm2d in train and eval mode?

AndersML · March 8, 2021, 1:07pm

Hi!

I’m doing semantic segmentation (for building detection) using a library with a unet implementation. It has BatchNorm2d in most stages.

The layers get the following configuration:
BatchNorm2d(X, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
where X depend on layer.

I get very different result for evaluation and training and the evaluation is a little random:
Epoch: 97
train: 100%|██████████| 4894/4894 [39:03<00:00, 2.09it/s, mseloss - 0.002174, iou_score - 0.8588]
valid: 100%|██████████| 572/572 [00:13<00:00, 41.39it/s, mseloss - 0.01254, iou_score - 0.3587]

Epoch: 98
train: 100%|██████████| 4894/4894 [39:03<00:00, 2.09it/s, mseloss - 0.002169, iou_score - 0.8582]
valid: 100%|██████████| 572/572 [00:13<00:00, 41.51it/s, mseloss - 0.01897, iou_score - 0.001748]
(Train has 8 samples per batch, valid has 1)

Are there any recommendations for using BatchNorm2d to get similar behaviour in training and evaluation mode?

If I evaluate in training mode, then I get better results and the network have more similar environment. Why isn’t there a similar mode for evaluation?

If I evaluate in training mode but with torch.no_grad(), does the running stats update and get used and what does happen to the affine values?

Could I change track_running_stats=False when using the model in evaluation and use more samples in the batch or larger images? (Is there an easy way to change that for one layer if I load a model? (I can iterate the layers.))

Could I warm up the batch normalization for the evaluation before looking at the results?

Should I try a larger momentum so that the affine isn’t too influenced by the learning last iteration?

Thanks for loking at this!

Best regards
Anders

ptrblck · March 9, 2021, 8:16am

A bad evaluation performance could point towards running stats, which don’t represent the dataset well. This would have different reasons, such as different data domains for the training and validation sets. E.g. batchnorm layers would most likely “break”, e.g. if you forget to normalize the validation inputs (while it was performed during training).

You could disable the running stats during evaluation and also use the batch statistics. However, this would also mean that your validation outputs and predictions depend on the batch size and you might get worse results for different batch sizes or your model might even raise an error, e.g. if stats cannot be computed from a single sample, which is often not desired in the validation/test use case.

Yes, the running stats will be updated, which could be considered a data leak. The affine parameters won’t be updated, since the forward pass was performed in the no_grad() context and you are most likely not calling loss.backward() and optimizer.step() to update the affine parameters.

Yes, you could, but see the previous point for some disadvantages.
I don’t know, if you can switch the behavior of batchnorm layers after they were initialized (I think this option wasn’t working recently, but you should double check and in doubt initialize these layers directly in the desired mode).

I would also consider this a data leak. The validation samples might perform better, but you would have to ask yourself, if you could even apply the same once this model is deployed. E.g. would you also use a warmup phase? If not, then the validation results would be biased and your model could perform badly on real unseen samples.

Yes, changing the momentum might help in your use case.

AndersML · March 9, 2021, 10:01am

Thanks for the reply! You are awsome!

The running stats looks to handle the differences between different areas quite well in the training so I think that could be a good thing also for evaluation. The model will be used over larger areas and each area have similar properties so running stats could be used in production per area. This network only have batch normalization that depend on the mode train/eval so it could be used in training mode.
The settings for the layers are in a library so I need to change the library or modify the layers afterwards.

Is there a drawback using the running stats for data normalization in production (data leak)?
(Normalization is difficult and this normalization occur deep in the network.)

I will also try to see if I can change the momentum and get a network that is more robust and don’t need the running stats.

Thanks again!

ptrblck · March 10, 2021, 12:30am

It might not be straightforward to guarantee the same batch size during the usage of this model in production. This could be an issue, if the batch stats cannot be computed e.g. from a single sample (stddev would be zero). In any case, if your workflow allows for a constant batch size, you could certainly try to use this approach.

AndersML · March 10, 2021, 4:05pm

Hi again!

I looked in the manual and the momentum value should be lower (not higher as I wrote above) to get more stable statistics. I have lowered it and restarted training. Now it lookes like evaluation mode is getting more stable, but I might need to try more values.