Hessian vector product optimization

This is a piece of code that compute Hessian vector product (gradient of gradient with regard to a given vector).

input = torch.tensor([1.0, 2.0, 0.5, 0.2], requires_grad=True)
output = input.tanh().sum()
grads = torch.autograd.grad(output, input, create_graph=True, retain_graph = True)
flatten = torch.cat([g.reshape(-1) for g in grads if g is not None])
for i in range(100):
┆ v = torch.randn(4)
┆ hvps = torch.autograd.grad([flatten @ v], input, allow_unused=True, retain_graph = True)
┆ print("{} {} {}".format(output.data, grads[0].data, hvps[0].data))

PyTorch says the bold part (retain_graph=True in hvps computation) is necessary. Otherwise this error msg shows up.

RuntimeError: Trying to backward through the graph a second time, but the buffers have already been freed. Specify retain_graph=True when calling backward the first time.

However, I am wondering if that “retain_graph=True” is REALLY necessary? I might be wrong, but it seems to me that computing Hessian vector product for v1 doesn’t depend on that for v2. Will this be unnecessary memory overhead? Could this code snippet be written differently to avoid keeping those graphs that are not needed?

In the backward pass, to reduce the memory footprint all the intermediate results are cleared by default. To backprop again, you have to again build the graph, with retain_graph=True the intermediates are not deleted.

Hi Kushaj,

Thanks for the prompt reply.

However, in my case, I think the memory footprint can be further optimized.

Basically, I have a function f, that I compute gradient g. Then I compute gradient of g with regard to different vectors in the for loop.

I believe that each iteration in the for loop depends on the computation of g, but doesn’t depend on each other. However, pyTorch is asking me to retain_graph for all computations, which I think is not necessary.

Do you see the special situation I am having here?


Use this as a reference as they go into more detail link