Model shows different predictions after training without weight update

Dear Community,

i encountered some none-intuitive behaviour. I load a model, set it to evaluation mode and predict a single image using model(input). Then, I set the model to training mode and predict a single image again. (Note that there is no .backward performed, I even disabled requires_grad for all parameters). When I set the model to evaluation mode again, and predict the same image from before, I get a different result.

TL;DR: when you eval an input before and after a training where no weight updates have been performed, the prediction in the evaluation phase changes.

Is this a bug or am I missing something here? You can run this code to check it yourself:

from torchvision import models
import torch

def set_parameter_requires_grad(model):
    for param in model.parameters():
        param.requires_grad = False

if __name__ == "__main__":
    model = models.densenet121(pretrained=True)
    set_parameter_requires_grad(model)
    model.train()
    input_ = torch.zeros((1,3, 224, 224))
    some_var = model(input_)
    model.eval()
    eval_value = model(input_)
    model.train()
    another_variable = model(input_)
    model.eval()
    eval_value_2 = model(input_)

    print(eval_value[0,0:3])
    print(eval_value_2[0,0:3])

output:

tensor([-0.3295,  0.2167, -0.6806])
tensor([-0.5839,  0.4981, -0.4104])

Edit: It’s not a dropout issue, the dropout_rate was 0 all along.

densenet121 uses batchnorm layers, which will update their running estimates during training in each forward pass.
During evaluation these running estimates will then be applied instead of the batch statistics, which explains the difference in your outputs.

1 Like

thanks ptrblck for the explanation. I thought the batchnorm layers get disabled completely during evaluation.
Cheers

Hi! what exactly do the model in model.train() vs model.eval()? with and without gradient computation?

model.train() and model.eval() will switch the internal self.training flag, which would then change the behavior of some layers. E.g. dropout layers will be disabled during evaluation and batchnorm layers will use the running stats instead of the batch statistics to normalize the activation. The gradient computation will not be changed or disabled.

1 Like

Thanks for your answer! What about the combination of the two commands (both ‘self.training options with/without gradient)?

I am a little bit confuse with the possibility of using back-propagation in the evaluation mode

Backpropagation and thus the gradient calculation will also work after calling model.eval() but as previously described the forward pass will be different. E.g. while the batchnorm layers will use the running_mean and running_var to normalize the data, the affine parameters (weight and bias) will still be trained and will get gradients.

I am using AlexNet for TL, extracting the features in different layers, E.g. Fc7:

layer = 'Fc7'
alexNet = models.alexnet(pretrained=True)
new_classifier = nn.Sequential(*list(alexNet.classifier.children())[:-1])
alexNet.classifier = new_classifier
#alexNet.eval()

I noticed the following:

  • I only see the effect of activating/deactivating the gradient calculation at the output of this last layer (in model.train()), is it because it is required to reach the end of the network to perform the loss calculation and then do the backpropagation?

  • Using model.eval() the results are the same with and without the gradient activation, why?

  1. I’m unsure which “effect of activating/deactivating the gradient calculation” you are referring to, so could you explain this effect a bit more? You don’t have to use the output of the last layer of a model to compute the gradients, since e.g. Inception models use an additional auxiliary loss.

  2. For the same input, the output of the model after calling model.eval() is expected to be the same (up to numerical precision) as long as no parameter updates were performed. Is this what you are seeing? If not, could you also explain the issue in more detail?

  1. With “effect of activating/deactivating the gradient calculation” I am referring to the output tensor of each layer.

  2. Yes! So even if the gradient has not been deactivated no weight will be modified in the evaluation mode (calling model.eval())?

  1. The gradient calculation does not influence the output tensors (of course up to numerical precision).

  2. No, parameters can still be updated after calling model.eval(), as .eval() will only change the behavior of some layers as described before.

Could you explain to me why the gradient calculation does not influence the output tensors?

There wouldn’t be a reason why the gradient computation should influence the model without any updates as seen here:

model = nn.Linear(1, 1)
x = torch.randn(1, 1)

for _ in range(10):
    out = model(x)
    print(out)
    out.backward() # compute gradients

Thank you so much! Your answers have been very helpful