Changed pytorch model behavior after loading pre-trained weights

I was just trying to train UNet from scratch with a mammography dataset to detect tumor tissue in mammograms. After training the model for 25 epochs I achieved the following results on TRAIN set:
Epoch 25, Loss: 0.0168, mean_IoU: 0.4271, mean_Dice: 0.3559, Elapsed time: 403.4554 sec

After saving model’s weights and trying to evaluate the model on TEST set, I got a very bad result:
mean_Iou: 0.007, mean_Dice: 0.006

At first, I thought the model is suffering from overfitting, but the model performed badly even on TRAIN set:
mean_Iou: 0.006, mean_Dice: 0.008

The model after loading weights was supposed to reproduce the result in Epoch 25 (am I correct?), but came with very bad results. I doubled check saving and loading model but nothing seems wrong.

Here is the link to implementation on GitHub: GitHub - hrnademi/Mammography
Here is the link to the reported log till epoch 25: Mammography/logs.txt at master · hrnademi/Mammography · GitHub

Further information about the dataset:
Training images: 2400 images with the size of 256x256 png
Test images: 600 images with the size of 256x256 png

In the pre-trained model is loaded.

Many thanks for your attention.

In the file you are loading the ‘Epoch_10_model.pth’ file. If I check in the logs at 10 epoch
Epoch 10, Loss: 0.0396, mean_Iou: 0.0, mean_Dice: 0.0911, Elapsed time: 403.3541 sec

But, yes, 0.008 is much less than 0.0911, however, iou is higher than during training. Hmm…not able to find anything else wrong…
Can you once evaluate on the model at 25th epoch. For that I guess you need to save the model at every 5th iteration. Or at least try with 20th epoch model.

I removed the background from the mammography images and start training again.
Here you can find the results after 20 epochs. I just updated logs.txt too. Unfortunately, nothing has changed, results on the test set are very bad.

@ ptrblck Could you help me with this problem?

If I understand the issue correctly, you are getting worse training and test results by loading the trained model.
Could you pass a constant input to the model (e.g. torch.ones) after training it (and before saving the state_dict) and save the output as a reference, and compare it to the result in your test script after loading the state_dict?
If these results are equal, then I guess the difference comes from the data loading and processing in both scripts.

Unfortunately, I got different results:
I think something is wrong with batch normalization in UNet

After training model and before saving state_dict (model.train())
tensor([[[[ -8.4269, -12.0579, -11.8041, …, -10.0886, -8.6157, -5.6419],
[ -5.6116, -10.0398, -8.1797, …, -6.5734, -5.6017, -8.4666],
[ -7.3116, -12.3692, -10.6413, …, -7.6334, -4.2797, -6.6685],
[ -3.6545, -5.0984, -5.1562, …, -3.8056, -4.3882, -9.5862],
[ -7.2274, -4.3889, -4.6838, …, -3.6184, -3.2181, -13.1806],
[ -7.8890, -4.6818, -6.3442, …, -5.1584, -4.0778, -6.4020]]]],
device=‘cuda:0’, grad_fn=)

After saving state_dict and set model to evaluation mode (model.eval())

tensor([[[[-5.5660, -5.7427, -6.7510, …, -6.1302, -6.1012, -7.4343],
[-5.5984, -5.3072, -4.8963, …, -6.6118, -5.3268, -5.8644],
[-6.6022, -5.5419, -5.0204, …, -6.9750, -4.1107, -5.6634],
[-4.4239, -4.8906, -6.2160, …, -5.3613, -4.3033, -3.7815],
[-6.3954, -5.0958, -6.3577, …, -5.2069, -4.2653, -3.2605],
[-6.4229, -4.5915, -4.6084, …, -3.9113, -3.7404, -3.7629]]]],
device=‘cuda:0’, grad_fn=)

It’s expected that the results differ if you pass the same input to the model in training and evaluation mode.
In the former case, the batchnorm layers will normalize the input activation using the batch statistics and will update the running stats with it, while the running stats will be used to normalize the activation in the latter case.
Could you compare the output in the training and evaluation script using model.eval() only?

Could you compare the output in the training and evaluation script using model.eval() only?

I experimented again the way you said and outputs were equal (Click here to see).

Following curves achieved after training model for 50 epochs, I set batch_size=1 and track_running_stats=False in every batch normalization that existed in UNet.
When I use the training set for both training and evaluating model, I get good results, but when use test set for evaluation I get weird results.

Link to training log: Mammography/logs.txt at master · hrnademi/Mammography · GitHub
1 2 3 4 5 6

That’s good to hear and indicates that saving and loading the model works properly.

If you are seeing bad results for the validation dataset, it points towards the overall training routine, the dataset splits etc. but not necessarily a software bug.