Is it possible to use gradient w.r.t input in loss?

wenxx151 · June 30, 2017, 10:10pm

Hi, I need to construct a loss function that uses the gradient of the output of the NN w.r.t the input, i.e. something like:

1 output = NN(input, weight)
2 grad_input = output.backward(retain_variables=True)
3 loss = criterion(grad_input.grad.data, target_of_grad_input)
4 loss.backward()

The problem is that after step 2, I can only have the numerical value of the gradient of output w.r.t to input (i.e. grad_input). Then it is no longer a function of the weight, and as a result the loss is not a function of weight. Then step 4 (where I want to compute the loss w.r.t the weight) will fail.

How should I tackle problem like this? Thanks for suggestions.

alexis-jacq · July 1, 2017, 12:42am

First, if I understand correctly the automatic differentiation, the correct gradient with respect to the input can be obtained like this:

output = NN(input, weight)
output.backward(retain_variables=True) 
grad_input = input.grad # at least, this is a variable, unfortunately volatile

Now, the problem is that grad_input is a volatile variable disconnected to the graph. So as you said, you have the value, but you don’t have grad_input = f(input) in the graph. So if you do z = grad_input*2 and try z.backward(), you will have an error.

You have to be patient, according to this topic, pytorch will soon keep the gradient into the graph, and it will be possible to call backward a second time and get second order derivatives.

But in your case, you can avoid the second order derivative with some mathematics:

if you want grad_input = target, then it is the same than: output = target*input + c, for any c.
that is the same than : (output-c)/input = target
so you can minimize your loss with respect to the parameters of your NN, plus an additive biais c:

if output and input have different dimentions, then in fact, you want to approximate the target with the Gramian produce of (output-c) and 1/input (with the pointwise inversion).

class NN2(NN):
    def __init__(self, output_size):
        super(NN2, self).__init__()
        self.c = Variable(torch.rand(output_size))
    def forward(input):
        output = super(NN2, self).forward(input)
        x_inv = (1/x).view(-1)
        gram = torch.mm( (input - self.c).view(-1), x_inv.t() )
        return gram

optimizer = optim.Adam(NN2.parameters(), lr)
gram = NN2(input)
loss = criterion(gram, target_of_grad_input)
loss.backward()

wenxx151 · July 1, 2017, 2:55am

Thanks Alexis for the reply.

If I understand correctly, the following is not generally true,

if you want grad_input = target, then it is the same than: output = target*input + c, for any c.
that is the same than : (output-c)/input = target

provided that grad_input is the gradient of output w.r.t to the input, i.e.

grad_input = d{output}/d{input}.

Sorry for not making it clear.

Actually, If we backward (output-c)/input, we get

d{(output-c)/input}/d{weight} = [d{output}/d{weight}*input - (output-c)*d{input}/d{weight}] / (input*input).

Apparently, this is not

d(d{output}/d{input})/d{weight})

needed by the optimizer.

I don’t think there is work-around for the second derivative. So I’ll keep an eye on the topic you mentioned.

alexis-jacq · July 1, 2017, 7:44am

to simplify the notations i’m using input = x, output = y, target = t, weight = w

If I understand correctly, the following is not generally true,

Off course, this thig has no sens when at least one element of x is close to 0. but what you can do is to change your target, since you want y(w,x) = xt + c, by playing with w and c. So you can detach x and put it into your target:
min criterion[y(w,x)-c, tx] by playing with w and c

Sorry for not making it clear.
Actually, If we backward (output-c)/input, we get
[…]
needed by the optimizer.

The thing is, what do you want for your equilibria ? if you want to play with w in order to get dy/dx = t, then both methods converge to this equilibria. With my technic, you will play with w and c in orger to get y = t*x + c for all (x,t)
You can verify that the two equilibria are equivalent.

And off course, d( dy/dx )/dw is different than d( y )/dw. Thins is not a probleme, we are no longer trying to minimze the same functions. The only thing that count is what we obtain after convergency.

I don’t think there is work-around for the second derivative. So I’ll keep an eye on the topic you mentioned.

Don’t give up ! You don’t need second order derivative !

wenxx151 · July 1, 2017, 7:37pm

The thing is, what do you want for your equilibria ?
…
You can verify that the two equilibria are equivalent.

@alexis-jacq I get your idea. I’ll think about how to implement it. Thanks!

Wong_Fungtion · April 2, 2018, 2:34am

you can just implement the criterion yourself, it can get the gradient w.r.t input