Regarding loss.backward() inside "with torch.no_grad"

cbd · March 10, 2021, 11:44am

In training mode, function is called which calculate the loss and do “loss.backward()”.
In validation, if i call the same function which lies inside “with torch.no_grad”, is it do “loss.backward()” OR not?

Suraj · March 10, 2021, 12:54pm

Hi,

No. You don’t have to call loss.backward() during validation since the purpose of validation is to assess the model on “unseen” data during training (although validation data does not remain unseen when you use do validation over different folds; You can ignore this to avoid confusion and subtleties to it can be discussed later).

loss.backward() is called to compute gradient of loss w.r.t to each trainable parameter. Now since during validation you don’t want to update model based on validation loss (because that would kill the purpose of validation, that is, to test the model on a portion (say 10%) of training data after training it on the other portion (i.e. 90%).

Side note: “with torch.no_grad()” is just to do assessment without involving gradients (or keeping track of gradients) because you anyway don’t want to deal with gradients or any kind of update to trainable parameters, during the assessment (i.e. validation).

cbd · March 11, 2021, 6:32am

You mean to say, we should use “torch.no_grad()” so it will not calculate gradient and at the same time should not call “loss.backward()” in validation. If we use “loss.backward()” in validation then i think it backward the stored gradient. Am i right?

Suraj · March 11, 2021, 7:45am

Yes.

torch.no_grad() makes computation faster as we no longer keep track of gradients. I think (also checked through short code) loss.backward() inside torch.no_grad() still calculates gradients which can be backpropagated. So be careful with using loss.backward().

For eg. :

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b

with torch.no_grad():
    
    optimizer.zero_grad()
    loss = torch.sum(out)
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

throws an error that there is no grad_fn (means gradient function) associated with one of the variables from which loss value is calculated. (because torch.no_grad sets requires_grad = False for all the tensors inside the loop. See this.)

Now, If you write the same code but with loss calculated outside torch.no_grad(), autograd will calculate gradient of loss w.r.t each of the trainable parameter.

import torch
import torch.nn as nn

a = torch.randn(200,100)
b = torch.randn(200,20)
layer = nn.Linear(100,20)
optimizer = torch.optim.Adam(layer.parameters(), lr = 0.001)
print(layer.weight.data)

out = layer(a)-b
loss = torch.sum(out)

with torch.no_grad():
    
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    
    print(layer.weight.grad)
    print(layer.weight.data)

You will see that gradients were calculated and backpropagated and the optimizer was able to make changes to the layer’s weights. Sorry I made it unnecessarily long. Hope you could follow through.

In almost all cases you would use backward call inside the training loop when the loss is calculated over training data. torch.no_grad make computation faster, so you use it in validation and inference after training where gradient calculation is no longer required.

SaturnTsen · July 25, 2024, 2:58pm

So why is it recommended (or necessary) to call loss.backward() inside with torch.no_grad()? Is it possible to remove `with torch.no_grad() and what happens?

The tutorial dive into deep learning implements the following code

    def fit_epoch(self):
        """Defined in :numref:`sec_linear_scratch`"""
        self.model.train()
        for batch in self.train_dataloader:
            loss = self.model.training_step(self.prepare_batch(batch))
            self.optim.zero_grad()
            with torch.no_grad():
                loss.backward()
                if self.gradient_clip_val > 0:  # To be discussed later
                    self.clip_gradients(self.gradient_clip_val, self.model)
                self.optim.step()
            self.train_batch_idx += 1
        if self.val_dataloader is None:
            return
        self.model.eval()
        for batch in self.val_dataloader:
            with torch.no_grad():
                self.model.validation_step(self.prepare_batch(batch))

ptrblck · July 25, 2024, 7:08pm

It’s neither recommended, necessary, nor will it work to call backward inside a no_grad context:

model = models.resnet18()
x = torch.randn(1, 3, 224, 224)

# works
out = model(x)
out.mean().backward()

# fails
with torch.no_grad():
    out = model(x)
    out.mean().backward()
# RuntimeError: element 0 of tensors does not require grad and does not have a grad_fn