I tried to train a model with batchnorm layers. During the training, I set model.train(). Every 100 iteration, I validate the accuracy and set model.eval(). However, the validation is not correct. I don’t think this is due to overfitting because even if I use the same image as training, the testing loss is also quite different from the training loss. Also, if I still set model.train() during testing, the testing loss is correct. But such usage does not make sense because my model contains batchnorm layers.
Below is my training code:
for epoch in range(num_epochs):
epoch_loss = 0.0
optimizer = lr_scheduler(optimizer, epoch)
for iteration, data in enumerate(dataloader, 0):
iter_index += 1
label_patch = data['label_patch']
residue_patch = data['residue_patch']
stacked_patch = data['stacked_patch']
microshift_patch = data['train_patch']
inputs, residues, labels, microshifts = Variable(stacked_patch.type(dtype)), Variable(residue_patch.type(dtype), requires_grad=False), Variable(label_patch.type(dtype), requires_grad=False), Variable(microshift_patch.type(dtype), requires_grad=False)
# set model to train mode (before zero grad)
model.train()
# zero the parameter gradients
optimizer.zero_grad()
# forward
outputs = model(inputs)
loss = criterion(outputs, residues)
# backward + optimize only if in training phase
loss.backward()
optimizer.step()
# statistics
epoch_loss += loss.data[0]
# test the model every 100 iterations
if iter_index % logging_iter == 0:
loss_test, psnr_test = test_model(model)
# checkpoint for each epoch
model_out_path = "checkpoints/model_epoch_{}_residue.pth".format(epoch)
torch.save(model, model_out_path)
change the momentum term in BatchNorm constructor to higher.
before you set model.eval(), run a few inputs through model (just forward pass, you dont need to backward). This will help stabilize the running_mean / running_std values.
Thanks for your reply. I tried them but still get the error. I found that using the same code, sometimes the model.eval() can be correct but sometimes incorrect. I will try further and update if I found a solution.
Any update regarding this problem. I already posted the same question. it seems to me that many are facing the same problem. Could pytorch community react to this problem ?
I’m trying to load caffe weights in a pytorch model with batchnorm layers, each time I load the weights from the caffemodel file, the result for the same input is different even in eval mode.
I’m actually updating the running_mean and running_var from the caffemodel weights, so there shouldn’t be any issue with bad running_means during inference.
It’s not a problem in the sense that it’s not a software bug.
It’s a problem in the sense that if you have a non-stationary training, you will see this behavior unless you adjust your momentum term of the BatchNorm. We set the momentum to 0.1 because for most workloads that we use it was sufficient. Play around with it.
@smth Agreed that running_mean may mean different things in Caffe and PyTorch. However in eval mode, I guess only these 5 things - (running_mean, running_var, weight, bias, eps) should affect the final output.
I’ll try to see if I can come up with a minimal reproducible example for this.
@falmasri means the statistics of activations change rapidly during training, such that the running_mean and running_std statistics for BatchNorm at the momentum of 0.1 are not valid anymore.
Nice, setting the momentum to 0.5 seems to make loss calculated using model.eval() similar to the loss computed using model.train(), but only after few epochs of differing results.
Does the second suggestion work only when we are loading the model for eval? It should not affect the loss calculation, correct or incorrect, if we are training the network and running validation every nth step.
Thanks for the reply! The training batch size is 6 instead of 1. Actually I have also tried later batch size (32) with other architectures (upsampling on ResNet18) but the bug remains. My major question is I don’t understand why pytorch 0.1.12 works while >= 0.2 does not.