Hi, I have been stuck on a problem, and I’m wondering if there is a solution. Thank you in advance!
I have two embedding vectors,
v both of size
m, that are inputs to a torch
model, and I would like to efficiently compute a form of the model’s summed-hessian between the two vectors. I found that the simple approach is efficient enough for me
x = torch.stack([u,v]).view(1,2,m) #batchsize of 1, 2 vectors u v, each with size m y = model(x) #y is a scalar grads = (autograd.grad(y, x, create_graph=True).squeeze()).sum(1) #sum across emb vectors (not between them) grad_u = grads #a scalar grads = (autograd.grad(grad_u, x, retain_graph = True).squeeze()).sum(1) #sum across emb vectors grad_uv = grads #a scalar
.sum(1) in each
Under the hood, autograd computes
grad_uv to be the sum of derivatives:
(d^2 y)/(du_1dv_1) + (d^2 y)/(du_1dv_2) + (d^2 y)/(du_2dv_1) + (d^2 y)/(du_2dv_2) if the sizes of
v are just
However, what I would really like is this expression with the sum of squares:
((d^2 y)/(du_1dv_1))**2 + ((d^2 y)/(du_1dv_2))**2 + ((d^2 y)/(du_2dv_1))**2 + ((d^2 y)/(du_2dv_2))**2
It seems that chain rule cant give me this. Is it possible to compute this expression without explicitly computing the second
autograd.grad line over each element of vector
dy/du? I basically want to avoid the extra
for loop as is common in higher-order grad computations. I was thinking that the vector-jacobian product within
grad() could help, but would I need modify
autograd source code?
Thank you so much!