I am trying to implement TRPO, and I need the gradient of the network parameters w.r.t. the KL divergence between the current action distribution and the action distribution after the parameters are changed.
Put simply, we have:
action_dstribution0 <-- Network(state)
We need to calculate something to the effect of:
autograd.backward(KL(Network(state), action_distribution0), [torch.ones(1)])
What is the best way to implement this? How do I make sure action_distribution0 doesn’t get backpropagated through?
It sounds very similar to Q-learning - just use .detach() to prevent the gradients from getting propagated:
action_distribution0 = model(state).detach()
# detach() blocks the gradient
# also action_distribution0 doesn't require grad now, so it can be a loss target
# =========================================
# change the parameters of the network here
# =========================================
KL(Network(state), action_distribution0).backward()
In the dcgan example, the G(z) output is copied over before pushing through D (link).
Is this done in order to avoid backpropagation through G?
Would this be an equivalent way of doing things?
fake = netG(noise)
fake = fake.detach() # this is not an inplace op, so need to re-assign?
output = netD(fake)