If my model has dropout, do I have to alternate between model.eval() and model.train() during training?

Asa-Nisi-Masa · May 26, 2020, 8:47pm

So if a model has a dropout layer (or a batch-norm layer), then doing

model(x)

will/may yield a different result compared to

model.eval()
model(x)

But now for training what we have is an inference step and a training step. So do we actually have to do something like the following?

...
for epoch in range(epochs):
    model.eval()   # make sure we are in .eval() mode
    ...
    y_pred = model(x)
    ...
    model.train()  # make sure we are in .train() mode
    loss = loss_fun(y_pred, target)
    loss.backward()
    ...

What’s the correct way to train a network in this case?

Nikronic · May 26, 2020, 10:08pm

Hi,

About this question the answer is yes. The reason is that when you set model.eval() PyTorch removes all dropout layers (do not update mean/variance in batch norm). Here is a small snippet to test:

class Test(nn.Module):
    def __init__(self, pool_size=(4, 4)):
        super(Test, self).__init__()
       
        self.layer = nn.Linear(1, 1, bias=False)
        self.dropout = nn.Dropout()
   
    def forward(self, x):
        out = self.dropout(self.layer(x))
        return out

model.train()
for i in range(10):
    print(model(x))

If you run this code, few times (about 1/2) you will get 0 in the output. The reason is that I put 1 single neuron (self.layer) and a dropout with probability of 0.5, so, it will zero the output of tensor with probability of 0.5.

About

This is not true in your example. Only inference time during training is using validation set which is still validation not training. If you set model.eval() then get prediction of your models, you are not using any dropout layers or updating any batchnorm so, we can literally remove all of these layers. As you know, in case of dropout, it is a regularization term to control weight updating, so by setting model in eval mode, it will have no effect.

Bests
Nik

Asa-Nisi-Masa · May 27, 2020, 8:51am

Thank you for the response. So basically just use model.train() before training and model.eval() after initializing a trained model and it will have no side-effects.

Nikronic · May 27, 2020, 9:58am

Yes, exactly. Let’s say you are using a pretrained model. If you do not set model.eval, then some of values in resulted tensor will be zero which is not desired as we already trained the model and want a stable values not some random zeros where happens differently in every run.

Although remember that model.eval only has effect on dropout and batchnorm as we discussed, but still you need to use torch.no_grad() to loop over network for inference stage, otherwise, even if you have had set model.eval, gradients will be updated.

Asa-Nisi-Masa · May 27, 2020, 12:10pm

Although remember that model.eval only has effect on dropout and batchnorm as we discussed, but still you need to use torch.no_grad() to loop over network for inference stage, otherwise, even if you have had set model.eval, gradients will be updated.

I see. But when would this be necessary? During training, it is customary to zero out the gradients with something like optimizer.zero_grad() before backpropagation, so that’s not an issue during training.

For pure inference with a pretrained model, do the gradients matter in any way? I guess keeping track of them might introduce a slight overhead.

Nikronic · May 27, 2020, 12:47pm

drop out does not zero gradients, it zeros tensors in feed forward. Zeroing gradients is not good idea as it has been aggregated over all batches so far and may lead to underfitting. optimizer.zero_grad() is used to zero gradients after each forward and backward propagations.

If you do not use torch.no_grad(), gradients in pretrained model will be updated and the result will be overfitted on your test set.

Asa-Nisi-Masa · May 27, 2020, 12:58pm

If you do not use torch.no_grad() , gradients in pretrained model will be updated and the result will be overfitted on your test set.

I’m not sure I completely follow. If I am using a pretrained model and make predictions without torch.no_grad(), then even if the gradients are computed, no weights are actually being updated. There is no overfitting happening because there is no loss defined when doing just the inference, and no gradient descent being done.

Nikronic · May 27, 2020, 1:04pm

Oh my bad, I considered same code for training as your original post was about swaping between model.eval and model.train. But using torch.no_grad will save a lot of time and memory!

duddal · August 13, 2021, 12:32pm

@Nikronic Is it important to use model.eval() in the training loop?