How to calculate 2nd derivative of a likelihood function

Wei_Deng · March 17, 2018, 8:26pm

I want to calculate the diagonal of 2nd derivative of a function (likelihood function for example), but I didn’t find any corresponding documents supporting that?

Can anyone give me an example?

I really appreciate that.

Thanks a lot.

Wei_Deng · March 17, 2018, 8:34pm

I know there are some basic tutorials like

However, I am afraid they still can’t solve my problem.

Wei_Deng · March 17, 2018, 8:56pm

For example, I used his blog to try to get the 2nd derivative [Second order derivatives and inplace gradient "zeroing" ], but it turns out that the grd.grad information is None. Can anyone give me some suggestions?

import torch
from torch import Tensor
from torch.autograd import Variable
from torch.autograd import grad
from torch import nn

# some toy data
x = Variable(Tensor([4., 2.]), requires_grad=False)
y = Variable(Tensor([1.]), requires_grad=False)

# linear model and squared difference loss
model = nn.Linear(2, 1)
loss = torch.sum((y - model(x))**2)

optimizer = torch.optim.Adam(model.parameters(), lr=1e-2)

# instead of using loss.backward(), use torch.autograd.grad() to compute gradients
loss_grads = grad(loss, model.parameters(), create_graph=True)

gn2 = sum([grd.norm()**2 for grd in loss_grads]) # 2nd derive
print(‘loss %f grad norm %f’ % (loss.data, gn2.data))
model.zero_grad()
gn2.backward()
optimizer.step()

for grd in loss_grads:
print grd.grad

The answer is None.

dpernes · March 19, 2018, 11:36am

Hi @Wei_Deng,

You are doing everything right, except that you’re not looking for the gradients in the right place.

After a backward pass, only the gradients of the loss with respect to model parameters will be kept. Thus, if your model, has some parameter theta, then you’ll find the gradient of the loss w.r.t. theta in the variable theta.grad.

Therefore, if you want to print gradients, change your final loop to something like:

for name, param in model.named_parameters():
  print(name, param.grad)

Wei_Deng · March 19, 2018, 5:53pm

Thanks you @dpernes for you inspiring comments, that works for that problem.

But I found this still can not help me find the 2nd derivative of a likelihood function for each weight, do you know how to achieve that?

Or do you know if we can do it based on the current pytorch version?

dpernes · March 20, 2018, 10:47am

What do you mean exactly by second derivative? Do you want to find the full Hessian matrix or only the second order derivatives with respect to each parameter individually (i.e. the diagonal of the Hessian matrix)?

Btw, what is your final purpose?

Wei_Deng · March 20, 2018, 6:31pm

@dpernes, my final goal is to get the diagonal of the Hessian matrix to estimate the empirical Fisher information matrix.

dpernes · March 21, 2018, 11:58am

That’s a bit tricky, I think. But it’s doable, of course.

It is tricky because PyTorch only allows you to compute derivatives of scalars with respect to multidimensional Tensors. Thus, you have to iterate through every single scalar parameter in your model (i.e., every entry in every parameter matrix) and compute the derivative of its derivative with respect to itself.

Honestly, I don’t know if there is an easy way to iterate through every entry in an n-D Tensor for arbitrary n (in numpy, there is nditer for this job). If you can find one, then the task is easy.

Let me show some pseudo-code (which is “pseudo” because the function iterator_over_tensor is unspecified).

import torch
from torch import Tensor
from torch.autograd import Variable
from torch.autograd import grad
from torch import nn

# some toy data
x = Variable(Tensor([4., 2.]), requires_grad=False)
y = Variable(Tensor([1.]), requires_grad=False)

# linear model and squared difference loss
model = nn.Linear(2, 1)
loss = torch.sum((y - model(x))**2)

# instead of using loss.backward(), use torch.autograd.grad() to compute gradients
loss_grads = grad(loss, model.parameters(), create_graph=True)

# compute the second order derivative w.r.t. each parameter
d2loss = []
for param, grd in zip(model.parameters(), loss_grads):
  for idx in iterator_over_tensor(param)
    drv = grad(grd[idx], param[idx], create_graph=True)
    d2loss.append(drv)
    print(param, drv)

Wei_Deng · March 21, 2018, 1:52pm

Thank you so much for helping me find a possible way of doing this. Hope someday PyTorch people could find develop some new version to handle this.

nima_rafiee · May 21, 2019, 1:41pm

Simply try backward() function two times and you get the diagonal of Hessian matrix.

x = torch.ones(2, requires_grad=True, )
y = torch.pow(x,3)
out = torch.mean(y)
print(y)
print(out)
out.backward( retain_graph=True, create_graph=True)
print(x.grad)
print(out)
out.backward()
x.grad

albanD · September 27, 2019, 3:51pm

Hi,

No this won’t give you second derivative.
See the answer from @dpernes above for the right way to do this.

Peter_Ham · October 2, 2019, 4:44am

I want to do similar things, basically like in meta learning, you have a function y = f_{\theta}(x,t), and you update your variable t such that f(x,t) approximates the target y_target. The mathematics is like this: t’ = t-\nabla_t |f(x,t)-y_target|^2, then you update \theta using \theta = \theta-\nabla_{\theta}|f(x,t’)-y_target|^2. here you need to first update your t, and then once t is updated, you want t to still carry the gradient and then you update the theta variables. I don’t think the proposed solution can do this? But in tensorflow this is very easy and you can update t for multiple time, which means it can not only take the secondary derivative but also the third derivative and more. Is pytorch able to achieve this?

albanD · October 2, 2019, 2:16pm

Hi,

Yes you can take any degree of derivatives by calling backward() or autograd.grad() on the output of such function.

That being said, in your example, I don’t see any second derivative. You update t, then evaluate f with this new t, then update theta right?

Peter_Ham · October 3, 2019, 7:36pm

you update t, but the t’ carries the gradient, and then you update f with the new t, it’s like the t’ should bring the computation you’ve done previously. I think autograd.grad() is not a good approach to do this, as you have to manually call this for each variable involved, which is tedious?

albanD · October 3, 2019, 7:45pm

Ok, then you can just do:

# This is pseudo code, you can have a list of params from `model.parameters()` for example
# Also the autograd.grad return tuples so some massaging and for loops for gradients update may be needed
t = torch.rand(xxx, requires_grad=True)
theta = torch.rand(xxx, requires_grad=True)

out = f(t, theta)
loss = F.mse(out, target)

gradt = autograd.grad(loss, t, retain_graph=True, create_graph=True)

new_t = t - gradt

new_out = f(new_t, theta)
new_loss = F.mse(new_out, target)

gradtheta = autograd.grad(new_loss, theta)

new_theta = theta - gradtheta

D-X-Y · January 4, 2020, 7:43am

Hi alban, do you know some efficient way to compute the second order derivate of f(x;w) instead of the for loop style?

albanD · January 4, 2020, 11:51am

Hi,

There is no for-loop here. What do you mean?

D-X-Y · January 4, 2020, 12:47pm

Hi albanD, I hope to compute Hessian matrix, and I write the following codes:

class Net(torch.nn.Module):
  def __init__(self, iS):
    super(Net, self).__init__()
    self.layer = torch.nn.Linear(iS, 1)
  def forward(self, inputs):
    outputs = self.layer(inputs)
    return outputs.mean()
net = Net(10)
inputs = torch.rand(256, 10)
loss = net(inputs)
first_order_grads = torch.autograd.grad(loss, net.parameters(), retain_graph=True, create_graph=True)
# strange .... first_order_grads[0].requires_grad = False, should it be True?
second_order_grads = torch.autograd.grad(first_order_grads, net.parameters())
# the above line does not work

Do you have any suggestions? Thanks!

albanD · January 4, 2020, 7:23pm

If requires_grad is False, it is because the gradient is independent of the parameters. And so the second order derivative would be 0.

D-X-Y · January 5, 2020, 12:57am

I see, thanks. I have revised the codes as following:

def test_auto_grad():
  class Net(torch.nn.Module):
    def __init__(self, iS):
      super(Net, self).__init__()
      self.layer = torch.nn.Linear(iS, 1)
    def forward(self, inputs):
      outputs = self.layer(inputs)
      outputs = torch.exp(outputs)
      return outputs.mean()
  net = Net(10)
  inputs = torch.rand(256, 10)
  loss = net(inputs)
  first_order_grads = torch.autograd.grad(loss, net.parameters(), retain_graph=True, create_graph=True)
  first_order_grads = torch.cat([x.view(-1) for x in first_order_grads])
  second_order_grads = []
  for grads in  first_order_grads:
    s_grads = torch.autograd.grad(grads, net.parameters())
    second_order_grads.append( s_grads )

As you can see, if I want to obtain the Hessian matrix, I have to loop for every grad in the first_order_grads. Is there an efficient way to calculate the second_order_grads?