Hi, I have been stuck on a problem, and I’m wondering if there is a solution. Thank you in advance!
I have two embedding vectors, u
and v
both of size m
, that are inputs to a torch model
, and I would like to efficiently compute a form of the model’s summed-hessian between the two vectors. I found that the simple approach is efficient enough for me
x = torch.stack([u,v]).view(1,2,m) #batchsize of 1, 2 vectors u v, each with size m
y = model(x) #y is a scalar
grads = (autograd.grad(y, x, create_graph=True)[0].squeeze()).sum(1) #sum across emb vectors (not between them)
grad_u = grads[0] #a scalar
grads = (autograd.grad(grad_u, x, retain_graph = True)[0].squeeze()).sum(1) #sum across emb vectors
grad_uv = grads[1] #a scalar
Note the .sum(1)
in each autograd.grad
line.
Under the hood, autograd computes grad_uv
to be the sum of derivatives: (d^2 y)/(du_1dv_1) + (d^2 y)/(du_1dv_2) + (d^2 y)/(du_2dv_1) + (d^2 y)/(du_2dv_2)
if the sizes of u
and v
are just m=2
.
However, what I would really like is this expression with the sum of squares: ((d^2 y)/(du_1dv_1))**2 + ((d^2 y)/(du_1dv_2))**2 + ((d^2 y)/(du_2dv_1))**2 + ((d^2 y)/(du_2dv_2))**2
It seems that chain rule cant give me this. Is it possible to compute this expression without explicitly computing the second autograd.grad
line over each element of vector dy/du
? I basically want to avoid the extra for
loop as is common in higher-order grad computations. I was thinking that the vector-jacobian product within grad()
could help, but would I need modify autograd
source code?
Thank you so much!