Autograd for different loss functions in different layers

Hey all,

I am having a problem with the autograd for different losses in diffrent layers, I want to implement a new algorithm, but first I did a check test to see if results are the same.

import torch 
import torch.nn as nn

device = 'cpu'

#define neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Linear(1, 1, bias=False))
        self.layer2 = nn.Sequential(
            nn.Linear(1, 1, bias=False))
    def forward(self, x):
        out1 = self.layer1(x)
        out2 = self.layer2(out1)
        return out1,out2
model = Net().to(device)

#desired output
t = (torch.randn(1, 1)).to(device)

inp = (torch.randn(1, 1)).to(device)

#run network
outputs = model(inp)

#loss from the ouput layer
loss = 0.5*torch.sum( (t - outputs[1])**2.0)

#calculate the gradients just to check

#gradient of the second weight
grad1 = torch.autograd.grad(loss, model.layer2[0].weight, retain_graph=True)

#part of the chain rule to calculate the gradient of the first weight
t1 = torch.autograd.grad(-loss, outputs[0], create_graph=True, retain_graph=True)[0]

#so it will be same as backpropagation
t1 += outputs[0]

#new loss for the gradient
loss2 = 0.5*torch.sum( (t1 - outputs[0])**2.0)

#gradient for the first weight
grad2 = torch.autograd.grad(loss2, model.layer1[0].weight)


loss2 is the loss function for the hidden layer. Mathematically this should be the same as the usual backpropagation, the results should be equal, but they aren’t. I checked t1 and it is all right, but something is going on with grad2, can someone help me with this?


Could you explain why grad2 should be the same as plain gradient? it uses gradient penalty computed in a differentiable manner. meaning that you introduce second order gradients here. Maybe you did not want the create_graph=True?


In the usual way, the gradient of the first weight would be:
grad1 = -(t - outputs[1]) * outputs[1]
grad2 = -(t-outputs[1])* model.layer2[0].weight * outputs[0]

For the new loss function in the hidden layer, I have:
L = 0.5*torch.sum( (t1 - outputs[0])*2.0)
and t1 is:
t1 = outputs[0] + (t-outputs[1])
Which the algorithm is right

Now, I want the gradient for the first weight is the derivative of L with respect to the weight, which is:
grad = -(t1 - outputs[0]) * outputs[0]
grad = -(t-outputs[1]) * model.layer2[0].weight * outputs[0]

Which is same as the usual way, if I disable create_graph=True, then I get zero gradient, maybe I am understanding wrong this create_graph?

By grad1 you mean model.layer2[0].weight.grad wrt to the first loss you computed ?
If so,the loss is actually loss = 0.5*torch.sum( (t - outputs[0] * model.layer2[0].weight)**2.0)
So grad1 would be -(t - outputs[1]) * outputs[0] and not -(t - outputs[1]) * outputs[1].

And if grad2 is for model.layer1[0].weight.grad wrt to the first loss computed, this loss is loss = 0.5*torch.sum( (t - x * model.layer1[0].weight * model.layer2[0].weight)**2.0) and so you get grad2 = -(t-outputs[1])* model.layer2[0].weight * x right?

Yes, grad1 is correct, I made a typo.

For grad2, that indeed means model.layer1[0].weight.grad I thought I could write directly the output values, without the weights, in the loss function, since I would like to use t1 instead of t, but this is not possible, is it?

I don’t think it is.
Also why do you compute t1 with -loss?

Ah too bad, I will have to find I workaround then, because I really need the derivative of the loss with t1 present there.

The minus sign is because I wrote the loss as (t - output) instead of (output - t), the update of the weights will be w += lr * grad instead of w += -lr * grad later on.

But thank you anyway for the help.

The minus sign is because I wrote the loss as (t - output) instead of (output - t)

But this is squared. So both are exactly the same. You don’t need a - sign here right?

Sorry, I meant the derivative of the loss.