Autograd for different loss functions in different layers

Tiago · January 22, 2020, 2:27pm

Hey all,

I am having a problem with the autograd for different losses in diffrent layers, I want to implement a new algorithm, but first I did a check test to see if results are the same.

import torch 
import torch.nn as nn

device = 'cpu'

#define neural network
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.layer1 = nn.Sequential(
            nn.Linear(1, 1, bias=False))
        self.layer2 = nn.Sequential(
            nn.Linear(1, 1, bias=False))
        
    def forward(self, x):
        out1 = self.layer1(x)
        out2 = self.layer2(out1)
        return out1,out2
    
model = Net().to(device)

#desired output
t = (torch.randn(1, 1)).to(device)

#input
inp = (torch.randn(1, 1)).to(device)

#run network
outputs = model(inp)

#loss from the ouput layer
loss = 0.5*torch.sum( (t - outputs[1])**2.0)

#calculate the gradients just to check
loss.backward(retain_graph=True)

#gradient of the second weight
grad1 = torch.autograd.grad(loss, model.layer2[0].weight, retain_graph=True)

#part of the chain rule to calculate the gradient of the first weight
t1 = torch.autograd.grad(-loss, outputs[0], create_graph=True, retain_graph=True)[0]

#so it will be same as backpropagation
t1 += outputs[0]

#new loss for the gradient
loss2 = 0.5*torch.sum( (t1 - outputs[0])**2.0)

#gradient for the first weight
grad2 = torch.autograd.grad(loss2, model.layer1[0].weight)

print(grad2[0])
print(model.layer1[0].weight.grad)

loss2 is the loss function for the hidden layer. Mathematically this should be the same as the usual backpropagation, the results should be equal, but they aren’t. I checked t1 and it is all right, but something is going on with grad2, can someone help me with this?

albanD · January 22, 2020, 2:31pm

Hi,

Could you explain why grad2 should be the same as plain gradient? it uses gradient penalty computed in a differentiable manner. meaning that you introduce second order gradients here. Maybe you did not want the create_graph=True?

Tiago · January 22, 2020, 2:42pm

Sure,

In the usual way, the gradient of the first weight would be:
grad1 = -(t - outputs[1]) * outputs[1]
grad2 = -(t-outputs[1])* model.layer2[0].weight * outputs[0]

For the new loss function in the hidden layer, I have:
L = 0.5*torch.sum( (t1 - outputs[0])*2.0)
and t1 is:
t1 = outputs[0] + (t-outputs[1]) model.layer2[0].weight
Which the algorithm is right

Now, I want the gradient for the first weight is the derivative of L with respect to the weight, which is:
grad = -(t1 - outputs[0]) * outputs[0]
grad = -(t-outputs[1]) * model.layer2[0].weight * outputs[0]

Which is same as the usual way, if I disable create_graph=True, then I get zero gradient, maybe I am understanding wrong this create_graph?

albanD · January 22, 2020, 2:50pm

By grad1 you mean model.layer2[0].weight.grad wrt to the first loss you computed ?
If so,the loss is actually loss = 0.5*torch.sum( (t - outputs[0] * model.layer2[0].weight)**2.0)
So grad1 would be -(t - outputs[1]) * outputs[0] and not -(t - outputs[1]) * outputs[1].

And if grad2 is for model.layer1[0].weight.grad wrt to the first loss computed, this loss is loss = 0.5*torch.sum( (t - x * model.layer1[0].weight * model.layer2[0].weight)**2.0) and so you get grad2 = -(t-outputs[1])* model.layer2[0].weight * x right?

Tiago · January 22, 2020, 3:01pm

Yes, grad1 is correct, I made a typo.

For grad2, that indeed means model.layer1[0].weight.grad I thought I could write directly the output values, without the weights, in the loss function, since I would like to use t1 instead of t, but this is not possible, is it?

albanD · January 22, 2020, 3:15pm

I don’t think it is.
Also why do you compute t1 with -loss?

Tiago · January 22, 2020, 3:20pm

Ah too bad, I will have to find I workaround then, because I really need the derivative of the loss with t1 present there.

The minus sign is because I wrote the loss as (t - output) instead of (output - t), the update of the weights will be w += lr * grad instead of w += -lr * grad later on.

But thank you anyway for the help.

albanD · January 22, 2020, 3:28pm

The minus sign is because I wrote the loss as (t - output) instead of (output - t)

But this is squared. So both are exactly the same. You don’t need a - sign here right?

Tiago · January 22, 2020, 3:31pm

Sorry, I meant the derivative of the loss.