Sum of squared 2nd-order derivatives

mikey · February 20, 2020, 5:38am

Hi, I have been stuck on a problem, and I’m wondering if there is a solution. Thank you in advance!

I have two embedding vectors, u and v both of size m, that are inputs to a torch model, and I would like to efficiently compute a form of the model’s summed-hessian between the two vectors. I found that the simple approach is efficient enough for me

x = torch.stack([u,v]).view(1,2,m) #batchsize of 1, 2 vectors u v, each with size m
y = model(x) #y is a scalar

grads = (autograd.grad(y, x, create_graph=True)[0].squeeze()).sum(1) #sum across emb vectors (not between them)
grad_u = grads[0] #a scalar
grads = (autograd.grad(grad_u, x, retain_graph = True)[0].squeeze()).sum(1) #sum across emb vectors
grad_uv = grads[1] #a scalar

Note the .sum(1) in each autograd.grad line.
Under the hood, autograd computes grad_uv to be the sum of derivatives: (d^2 y)/(du_1dv_1) + (d^2 y)/(du_1dv_2) + (d^2 y)/(du_2dv_1) + (d^2 y)/(du_2dv_2) if the sizes of u and v are just m=2.

However, what I would really like is this expression with the sum of squares: ((d^2 y)/(du_1dv_1))**2 + ((d^2 y)/(du_1dv_2))**2 + ((d^2 y)/(du_2dv_1))**2 + ((d^2 y)/(du_2dv_2))**2

It seems that chain rule cant give me this. Is it possible to compute this expression without explicitly computing the second autograd.grad line over each element of vector dy/du? I basically want to avoid the extra for loop as is common in higher-order grad computations. I was thinking that the vector-jacobian product within grad() could help, but would I need modify autograd source code?

Thank you so much!