model.train(True) and model.train(False) give different results for the same input

Hi,

here the example

import numpy as np
import torch
from torchvision.models import resnet18  # this model has batchnorm

net = resnet18(True)  # load pretrained model
inputs = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = torch.autograd.Variable(torch.from_numpy(inputs), volatile=True)

# train=True
net.train(True)
Y1 = net(inputs)

# train=False
net.train(False)
Y2 = net(inputs)

# give different result
print Y1[0, :2]
print Y2[0, :2]

I know that model.train(mode=True/False) only affects dropout and batchnorm layers. But, in the example of above, I think all the parameters are kept the same (even the affected batchnorm layer).

Does any body explain this issue?

Best

Even the parameters are the same, it doesn’t mean the inferences are the same.

For dropout, when train(True), it does dropout; when train(False) it doesn’t do dropout (identitical output).

And for batchnorm, train(True) uses batch mean and batch var; and train(False) use running mean and running var.

2 Likes

I see. I guess it is different that caffe implementation. thanks!

are the mean and std at batch level even the requires_grad is False?

For example, the batch norm layer N of a model has mean and std of 1.0 and 2.0 respectively. This layer is frozen (requires_grad is False). I feed input batch of mean and std of 0.3 and 0.5 to the network. Which values of mean and std will be used in Batch Norm Layer N?

A layer doesn’t have requires_grad, only Variables have. running_mean and running_var are buffers, and are updated during forwarding. I assume train(True) will still use the batch mean and batch var.

I have a doubt, how can the inference be different when parameters are the same, here as both dropout and batchnorm layers are affected, so shouldn’t the parameters be different?

For dropout (there’s even no parameter in dropout), the dropout position is changing when train is True.
For BatchNorm, the train(True) will use the batch norm instead of running_mean and running_var and also running_mean and running_var will also change.

1 Like

This is a very strange behaviour. I think this is a bug.
If in training batch_mean and batch_std are used, and in eval, running_mean and runnning_std are used, then how is it possible to track the convergence?

I am trying to train a classifier. In the last batches of the first epoch I start to see good results. (batches which the network did not see before.)
However, when moving to evaluation mode - the results are terrible.
I think that the proper way of addressing train(true) for batch normalization is to apply the updated running_mean and running_std, and not the current batch mean and std.