Dear all,
I am using pytorch=1.3.1 to train a deeplab-liked resnet50 model with two dataloaders, one is a source-domain dataloader, the other a target-domain dataloader, batch_size=8.
With the help of V100-32G GPU, the performance, eg: Dice score, of the model on both domains gets better and better when I set model.train. But if I switch model.train to model.eval to do evaluation on source/target domain training/validation set, strange thing emerged that the performance of the model drops dramatically to 0! And by the way, if I use the trained model to test on small batch, eg, 4/8 images, model.train mode will get a normal performance but model.eval get 0 performance.
After debugging, the problem was caused by BatchNorm2D layers. If I set model.train to do evaluation on any training or validation set, everything will be ok. But if I set model.eval, once again there the problem is.
I also find that other pytorch users encountered the problem, and their solutions such as set track_running_stats=False when creating BN layers, not to reuse BN layers or set a large batch_size can not help solve me out.
Performance highly degraded when eval() is activated in the test phase
So if others also trapped into this problem, I will be very happy that you can provide any suggetion.
Best, dong.