criterion = nn.CrossEntropyLoss()
inputs, labels = Variable(inputs.cuda()),Variable(labels.cuda())
model.zero_grad()
outputs = model(inputs)
loss = criterion(outputs, labels)
grads = torch.autograd.grad(loss,model.parameters(),create_graph=True)
elem_prod = [ grad * vi for grad,vi in zip(grads,v) ]
# Is grad_ouptuts correct here ?
Hvp = torch.autograd.grad(elem_prod, model.parameters(),grad_outputs=[torch.ones(i.size()).type(torch.cuda.FloatTensor) for i in grads],create_graph=True)

The second grad() here takes in a non-scalar input, is the above implementation right way to do it ?

If it works, it is the same as the gradient of the sum of all components in elem_prod, which is what you want for hvp. If it throws an error, you might try summing over components and the list yourself.

I’m doing a similar approach in calculating this, however I’m getting OOM after a few batches. Suppose that you feed your data batch-by-batch to this code and get the HVP for each batch, separately. Any idea how to avoid the OOM?
I’m sure the reason for OOM is the “create_graph=True” option in the first grad calculation (which of course is necessary for the second grad calculation), thought I’d argue that during the second call for grad function you don’t need the create_graph=True option.
Basically, I’d like to free the computed graph after each HVP calculation.