'model.eval()' vs 'with torch.no_grad()'


it’s not “necessary” to be able to backprop when doing .eval(). It’s just that .eval() has nothing to do with the autograd engine and the backprop capabilities.


Why is model forward pass slow while using torch.nograd()


I don’t see any mention to speed in this blogpost.
Can you detail your question a bit more please?

Hi @ptrblck, is it required to set gradient enabled with torch.set_grad_enabled(True) after torch.no_grad change back to model.train() from model.eval(), or the gradient will be automatically enabled with model.train(). I just want to confirm, it should be automatically enabled though.

model.train() and model.eval() do not change any behavior of the gradient calculations, but are used to set specific layers like dropout and batchnorm to evaluation mode (dropout won’t drop activations, batchnorm will use running estimates instead of batch statistics).

After the with torch.no_grad() block was executed, your gradient behavior will be the same as before entering the block.


Thanks for your explaination.
I am actually more interested in the usage of model.eval() and torch.no_grad()…

so means during evaluation, it’s enough to use:

for batch in val_loader:
    #some code

or I need to use them as:

with torch.no_grad():
    for batch in val_loader:
        #some code



The first approach is enough to get valid results.
The second approach will additionally save some memory.


Thanks! That helps alot. :+1::+1:

1 Like

If I’m not using loss.backwards() in my eval loop, do I still need to set torch.no_grad()? Will it make any difference?

1 Like

You don’t need to, but you can save memory and thus potentially increase the batch size, as no intermediate tensors will be stored. :wink:


@albanD @ptrblck Thank you for your clear explanations! I have learned a lot from this question.

I tried both and I noticed that with torch.no_grad() in my evaluation loop it took ~5 sec more per epoch. I don’t get why.

By “both” do you mean model.eval() and torch.no_grad()?
If so, these calls are independent as explained in this previous post.

What is your baseline, that runs faster? Is it the training loop?
And how are you profiling the code? Note that CUDA operations are asynchronous, so that you would have to synchronize before starting and stopping the timer.

I’m only speaking about my eval loop (validation loop), and I’m comparing the eval loop with and without torch.no_grad(). In particular, with torch.no_grad() it’s a little bit slower (around 5 sec). I always use model.eval() before entering the loop. I have a class named RunManager(), which does exactly what the name implies, amongst many things it also keeps track for start/end times for each run (different set of parameters, like batch size, learning_rate etc.) and for each epoch. All my code runs in cuda.

That’s weird.
Could you post a code snippet to reproduce this issue?

I had a mistake in my code, model.train() was in the wrong line. Indeed, for the same run:

  • with torch.no_grad(): ~55 sec

  • without torch.no_grad(): ~57 sec

If we want to select the best model with minimum validation loss, why we need to set eval mode to compute the validation loss? Why not:
with torch.no_grad():

Why we don’t want to consider the DP or BN when computing validation loss?

If i only use model.eval(), does it compromise the final performance (ignore the speed now)?

Hi Jangang,
the dropout layers and batch normalization layers behave differently in train and eval(test) procedure. Specifically, it’s a stochastic layer with e.g. dropout rate of p in train, and it’s deterministic in eval. Thus when doing the evaluation(dev test), {with torch.no_grad(): model.train()} is not equivalent to {with torch.no_grad(): model.eval()}.

Hi Yongkai,
yes it does, accroding to the discussions above. After calling model.eval(), layers like dropout or batch normalization are stochastic and so is the final performance.