Regarding loss.backward() inside "with torch.no_grad"

Yes.

torch.no_grad() makes computation faster as we no longer keep track of gradients. I think (also checked through short code) loss.backward() inside torch.no_grad() still calculates gradients which can be backpropagated. So be careful with using loss.backward().

For eg. :

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b

with torch.no_grad():
    
    optimizer.zero_grad()
    loss = torch.sum(out)
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

throws an error that there is no grad_fn (means gradient function) associated with one of the variables from which loss value is calculated. (because torch.no_grad sets requires_grad = False for all the tensors inside the loop. See this.)

Now, If you write the same code but with loss calculated outside torch.no_grad(), autograd will calculate gradient of loss w.r.t each of the trainable parameter.

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b
loss = torch.sum(out)

with torch.no_grad():
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

You will see that gradients were calculated and backpropagated and the optimizer was able to make changes to the layer’s weights. Sorry I made it unnecessarily long. Hope you could follow through.

In almost all cases you would use backward call inside the training loop when the loss is calculated over training data. torch.no_grad make computation faster, so you use it in validation and inference after training where gradient calculation is no longer required.

1 Like