Regarding loss.backward() inside "with torch.no_grad"

In training mode, function is called which calculate the loss and do “loss.backward()”.
In validation, if i call the same function which lies inside “with torch.no_grad”, is it do “loss.backward()” OR not?

Hi,

No. You don’t have to call loss.backward() during validation since the purpose of validation is to assess the model on “unseen” data during training (although validation data does not remain unseen when you use do validation over different folds; You can ignore this to avoid confusion and subtleties to it can be discussed later).

loss.backward() is called to compute gradient of loss w.r.t to each trainable parameter. Now since during validation you don’t want to update model based on validation loss (because that would kill the purpose of validation, that is, to test the model on a portion (say 10%) of training data after training it on the other portion (i.e. 90%).

Side note: “with torch.no_grad()” is just to do assessment without involving gradients (or keeping track of gradients) because you anyway don’t want to deal with gradients or any kind of update to trainable parameters, during the assessment (i.e. validation).

You mean to say, we should use “torch.no_grad()” so it will not calculate gradient and at the same time should not call “loss.backward()” in validation. If we use “loss.backward()” in validation then i think it backward the stored gradient. Am i right?

Yes.

torch.no_grad() makes computation faster as we no longer keep track of gradients. I think (also checked through short code) loss.backward() inside torch.no_grad() still calculates gradients which can be backpropagated. So be careful with using loss.backward().

For eg. :

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b

with torch.no_grad():
    
    optimizer.zero_grad()
    loss = torch.sum(out)
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

throws an error that there is no grad_fn (means gradient function) associated with one of the variables from which loss value is calculated. (because torch.no_grad sets requires_grad = False for all the tensors inside the loop. See this.)

Now, If you write the same code but with loss calculated outside torch.no_grad(), autograd will calculate gradient of loss w.r.t each of the trainable parameter.

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b
loss = torch.sum(out)

with torch.no_grad():
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

You will see that gradients were calculated and backpropagated and the optimizer was able to make changes to the layer’s weights. Sorry I made it unnecessarily long. Hope you could follow through.

In almost all cases you would use backward call inside the training loop when the loss is calculated over training data. torch.no_grad make computation faster, so you use it in validation and inference after training where gradient calculation is no longer required.