I have a neural network $F^{\theta}(x)$ characterized by parameters $\theta$, and I am trying to update the network parameters using the following loss function
$$
\frac{ \nabla_{\theta} F^{\theta}(x) }{ \nabla_{x} F^{\theta}(x) } \bigg|_{x = inputs},
$$
i.e. the gradient of $F$ wrt network parameters divided by the gradient of $F$ wrt its inputs. I have started implementing different approaches, which I believe lead to the same results:
model.optimizer.zero_grad()
x = inputs.clone().requires_grad_()
y1 = model(x)
# ... do one of the 3 approaches ...
loss.backward()
model.optimizer.step()
##### approach 1
grad_y_inputs = torch.autograd.grad(inputs=x,
outputs=y1.sum(),
grad_outputs=None)[0].detach()
y2 = model(inputs)
loss = y2 / grad_y_inputs
#####
##### approach 2
y1.sum().backward(create_graph=True)
loss = y1 / x.grad
#####
##### approach 3
y1.sum().backward()
y2 = model(inputs)
loss = y2 / x.grad
#####
- First of all, are the three approaches depicted here doing the desired gradient computation?
- Is there any preferred approach, e.g. computational or semantic issues?