model.train(True) and model.train(False) give different results for the same input

lolongcovas · August 5, 2017, 4:57pm

Hi,

here the example

import numpy as np
import torch
from torchvision.models import resnet18  # this model has batchnorm

net = resnet18(True)  # load pretrained model
inputs = np.random.randn(1, 3, 224, 224).astype(np.float32)
inputs = torch.autograd.Variable(torch.from_numpy(inputs), volatile=True)

# train=True
net.train(True)
Y1 = net(inputs)

# train=False
net.train(False)
Y2 = net(inputs)

# give different result
print Y1[0, :2]
print Y2[0, :2]

I know that model.train(mode=True/False) only affects dropout and batchnorm layers. But, in the example of above, I think all the parameters are kept the same (even the affected batchnorm layer).

Does any body explain this issue?

Best

ruotianluo · August 6, 2017, 5:35am

Even the parameters are the same, it doesn’t mean the inferences are the same.

For dropout, when train(True), it does dropout; when train(False) it doesn’t do dropout (identitical output).

And for batchnorm, train(True) uses batch mean and batch var; and train(False) use running mean and running var.

lolongcovas · August 6, 2017, 8:19am

I see. I guess it is different that caffe implementation. thanks!

lolongcovas · October 23, 2017, 1:58pm

are the mean and std at batch level even the requires_grad is False?

For example, the batch norm layer N of a model has mean and std of 1.0 and 2.0 respectively. This layer is frozen (requires_grad is False). I feed input batch of mean and std of 0.3 and 0.5 to the network. Which values of mean and std will be used in Batch Norm Layer N?

ruotianluo · October 23, 2017, 7:50pm

A layer doesn’t have requires_grad, only Variables have. running_mean and running_var are buffers, and are updated during forwarding. I assume train(True) will still use the batch mean and batch var.

unrahul · October 24, 2017, 5:45am

I have a doubt, how can the inference be different when parameters are the same, here as both dropout and batchnorm layers are affected, so shouldn’t the parameters be different?

ruotianluo · October 24, 2017, 11:06pm

For dropout (there’s even no parameter in dropout), the dropout position is changing when train is True.
For BatchNorm, the train(True) will use the batch norm instead of running_mean and running_var and also running_mean and running_var will also change.

moshe · February 14, 2018, 4:25pm

This is a very strange behaviour. I think this is a bug.
If in training batch_mean and batch_std are used, and in eval, running_mean and runnning_std are used, then how is it possible to track the convergence?

I am trying to train a classifier. In the last batches of the first epoch I start to see good results. (batches which the network did not see before.)
However, when moving to evaluation mode - the results are terrible.
I think that the proper way of addressing train(true) for batch normalization is to apply the updated running_mean and running_std, and not the current batch mean and std.