Why don't we use torch.no_grad() when getting validation performance during training?

I’ve seen posts (such as this one) where people discuss using no_grad() purely for the purpose of saving memory during inference.

However I was wondering if it makes sense to use it during training? In most setups, including mine, a researcher will train their model for 100 epochs say, and every 10 epochs they’ll do a forward pass on validation data to report loss and/or accuracy. From my understanding, every time you do a forward pass with a torch model (and you are not in a torch.no_grad() block), then gradients will accumulate for the weight tensors of the model, so that when you go to do the backward pass, the gradients are already there and just need to be multiplied.

If this is the case, then doesn’t this mean it is a good idea to use torch.no_grad() whenever we test our validation loss/accuracy? Otherwise, won’t the model’s gradients be impacted by validation data, meaning validation data impacts training?

I’m guessing I’m misunderstanding something about how torch.no_grad() works / how gradients are really computed / stored.

Any clarification would be great! :slight_smile:

That’s not quite right. The forward pass will store the intermediate activations, which would be needed to compute the gradients in the backward pass. The backward pass will then accumulate the gradients to each used parameter.
Using with torch.no_grad() will not store the intermediate activations and will thus save memory.

And yes, you are right that no_grad() can and is also used during the validation run (not only when deploying the model) as seen e.g. in the ImageNet example. Here you can see that the model is set to eval() first and the validation loop is executed in the no_grad context.

I see, thank you!

So will the intermediate activations be accumulated or replaced?

In other words, if I did:


And then I proceeded to do a backwards pass and update weights, would the activations from the second forward pass only be used to compute gradients?


The activations are stored in the computation graph so they won’t be accumulated or replaced.
I.e. if train_data is a new input tensor, which is not attached to any computation graph (e.g. from the previous forward pass) and if you compute the loss based on the output of the second pass only, the gradients will use only these intermediate activations.

However, if you reuse a computation graph, Autograd will backpropagate through the entire graph and will use all intermediate activations. This would be the case for e.g. a recursive model where the output is fed as the new input in a loop.