Applying an input-dependent transformation to the gradients during the backward pass

Hi everyone,

I have a specific usage of autograd and I want to know your opinion about the efficient way to make it work.

For context, I want to apply a transformation to the gradients before executing the optimization step. This transformation multiplies the gradient of the loss with the gradient of the prediction function.

For a parameter p, a prediction function (neural network) f, and a loss L, the transformation is the following:

grad_p = d(f_p)/d(p) * d(L_p)/d(p)

The problem is that both gradients are input-dependent, which means that the transformation must be applied before the gradient accumulation step during the backward pass.

The naive solution I can think of is using a batch_size of 1, in this case I can proceed as follows:

# batch input X, batch groundtruth y
prediction = model(X)
prediction.backward(retain_graph=True)
gradient = []
For param in model.parameters():
    gradients.append(param.grad)
model.zero_grad()
loss = loss_fn(prediction, y)  # MSE
loss.backward()
For i, param in enumerate(model.parameters()):
    param.grad *= gradients[i]

Now this is very compute-inefficient and I would like to ask you if there is a way to do it by batch applying the transformation on the fly during the backward pass?

I have thought about registering a backward hook but it’s not clear to me how to pass the first gradient to the hook.

Thank you.

That is an interesting question, and thanks for the clear code example.

Using a backward hook as you suggest is how I would approach it. I’ve not tested it, but have you tried something like adding:

loss.register_hook(lambda param_grad: param_grad * gradients.pop(0))

Right before loss.backward()? What I do here is consume the first list of gradients in the same order (this relies on both backward passes to occur in the same order).