Gradient computation in custom backward

albanD · November 30, 2020, 3:45pm

cont_loss_weight = torch.autograd.grad(outputs= cont_loss, inputs= weight, retain_graph=(True))
is getting executed but it neither show any error nor it returns something because its following line: print ("Shape:", cont_loss_weight .shape) doesn’t get printed.

This line would print at least “Shape:” irrelevant of the result from the previous line. So these lines just never get run.
Also autograd.grad always returns a tuple, even if you have a single Tensor to inputs.

As mentioned before, I don’t think the custom Function approach is the simplest here. Especially if you’re not already familiar with the mathematical definition and the specific constructs.

Hdk · November 30, 2020, 6:55pm

Sorry, I am not getting. Why doesn’t it reach to the line print ("Shape:", cont_loss_weight .shape) and print the shape? I don’t know why autograd running the same loop again and again ( I suspect something similar to this thread) because the same loss getting printed as you can see the output I was getting

Hdk:

Loss:  tensor([37.218], device='cuda:0', grad_fn=<AddBackward0>) True
Loss:  tensor([37.218], device='cuda:0', grad_fn=<AddBackward0>) True
Loss:  tensor([37.218], device='cuda:0', grad_fn=<AddBackward0>) True
.
.
RuntimeError: CUDA out of memory.

What the output I expected when I print ("Shape:", cont_loss_weight[0].shape) is because it should return a tuple of size 1 which has to be basically tensor of shape: ([128,96,5,5]) :

Loss:  tensor([37.218], device='cuda:0', grad_fn=<AddBackward0>) True
Shape: ([128,96,5,5])
Loss:  tensor([different loss value for next batch], device='cuda:0', grad_fn=<AddBackward0>) True
Shape: ([128,96,5,5])
Loss:  tensor([different loss value for next to next batch], device='cuda:0', grad_fn=<AddBackward0>) True
.
.

albanD · November 30, 2020, 7:02pm

If your autograd.grad calls this same backward function again then yes you will end up doing infinite recursion.

Hdk · November 30, 2020, 8:09pm

But I don’t think this is the case here. Because you see
Static forward:
Batch comes as input, output from convolution returned

Static backward:
We have saved tensors of weight, bias, output. cont_loss is calculated from output of forward. and within static backward only the backpropagation (cont_loss w.r.t weight,bias) happens. And for one batch static backward for this layer will be called once only (Hence, within backward backpropagation also happens once for one batch.)
Right?

albanD · November 30, 2020, 8:53pm

Isn’t the cont_loss that you compute during the backward computed based on the forward’s output?
If so, when you try to get gradients wrt to the weights, it will backrprop through this custom Function again trying to get gradients for the weights given gradients from the output. And so you will infinite recurse right?

Hdk · November 30, 2020, 10:30pm

It is indeed
Yes, I got your point.

I had that doubt, that’s why I tried to to this forward like in this code.
I am getting correct shapes of cont_loss_weight and cont_loss_bias, but I doubt if that’s a correct way!! What are your thoughts?

albanD · December 1, 2020, 2:54pm

This code is a bit confusing to me…
But again, it seems to just be calling into the autograd. Why not just the autograd end to end?? I really think you should drop the custom Function and compute that you need using the autograd!

Hdk · December 1, 2020, 10:43pm

But the thing is I have to include gradient of cont_loss somehow into to backward i.e. grad_weight += cont_loss_weight. I am not sure what you mean by this.

Can you write a code sample please?

albanD · December 2, 2020, 2:27pm

With latest nightly:

out, F = model(inp)
# here F is whatever information you need from the model to compute the contrastive loss

loss = criterion(out, label)
cont_loss = compute_cont_loss(out, F)

model.zero_grad()
loss.backward(retain_graph=True) # Use retain graph because you do two backward
cont_loss.backward(inputs=model.special_layer.parameters())
opt.step()

If you don’t have latest nightly. You can use autograd.grad() like you do your code and manually add these gradients to the params by doing something like

cont_params = list(model.special_layer.parameters())
grads = autograd.grad(cont_loss, inputs=cont_params)
for p, g in zip(cont_params, grads):
  p.grad += g

Hdk · December 3, 2020, 10:32pm

Hi @albanD,

Thanks for replying. Based on your recommendation, I tried the following but GPU goes out of memory.

## Training and Testing loop    
ct = torch.nn.CrossEntropyLoss()
ct2 = Cont_loss_torch_module(t=0.08).to(device) 

step= 0

for epoch in range(num_epochs):
    
    for _, (images, labels) in enumerate(train_loader):
        images= images.to(device)
        labels = labels.to(device)
        if step% 200== 0:
            Classifier.eval()
            with torch.no_grad():
                 #Testing
            Classifier.train()                
        conv4_out, output = Classifier(images)
        loss = ct(output, labels)
        Cont_loss = ct2(conv4_out, labels)

        # total_loss = loss + Cont_loss
        # total_loss.backward()

        loss.backward(retain_graph=True)
        cont_params = list(conv4.parameters())
        grads = autograd.grad(Cont_loss, inputs= cont_params)
        for p, g in zip(cont_params, grads):
            p.grad += g
        optimizer.step()                                
        optimizer.zero_grad()                      
        
        ## Tracking Accuracy
        ...
        step+=1

Does this mean that it will run the backward as usual but will only update the .grad fields of the parameters of that particular layer (i.e. conv4)?

albanD · December 4, 2020, 2:34pm

If you use autograd.grad it does not update any .grad field, that is why you have to do it by hand afterwards.

I tried the following but GPU goes out of memory

That would be unrelated. You need to make sure your model/batchsize are small enough to fit in memory

Hdk · December 5, 2020, 10:39pm

Hi, based on what I understood I am doing this, to make sure that the gradients of conv1-3, 5 get populated and updated according to cross entropy loss and that of conv4 accoring to both CE and cont_loss.

But I saw

RuntimeError: One of the differentiated Tensors appears to not have been used in the graph. Set allow_unused=True if this is the desired behavior.

Here’s the code.

optimizer = torch.optim.Adam(**parameters of conv 1-3, 5**, lr) 
optimizer_ = torch.optim.Adam(**parameters of conv 4**, lr)

# Training
        .
        .
        loss = ct(output, labels)
        Cont_loss = ct2(conv4_out, labels)
        loss.backward(retain_graph=True)
        cont_params = list(conv4.parameters())
        grads = autograd.grad(Cont_loss, inputs= cont_params)
        for p, g in zip(cont_params, grads):
            p.grad += g
        optimizer_.step()
        optimizer.step() 
        optimizer_.zero_grad()                               
        optimizer.zero_grad()

albanD · December 7, 2020, 2:59pm

The part of the code you shared looks good.

For the error, you want to make sure that all the params were actually used to compute the loss.

Hdk · December 7, 2020, 3:30pm

As you can see that for cont_loss calculation, I am giving output of 4th layer as input. And for calculating this output, obviously the parameters of 4th layers are used.

So, ultimately the parameters are used in computing the loss.

But for this use case, bigger batch sizes yield good performance as per earlier research work. So, what should I do to make sure that GPU doesn’t go out of memory when bigger batch sizes are used?

albanD · December 8, 2020, 5:11pm

From the name of the variables, it does look like you do. But the error is still there. Are you sure you don’t have extra Parameters on that Module that are not actually used during the forward?

So, what should I do to make sure that GPU doesn’t go out of memory when bigger batch sizes are used?

You can check on this forum for solutions.
You can do things like splitting the batch and doing several forward/backward before doing the optimizer step.
Or you can use tools like torch.utils.checkpoint to try and reduce the memory usage at the cost of more computations.

Hdk · December 8, 2020, 5:46pm

Yes, I am sure, because for the same ct2 loss function module, if I do like (loss+Cont_loss).backward() then it works.