I’m using Reinforcement Learning to train two networks, one is an actor network, noted as A, and the other is a critic network, noted as Q.
The actor network A receives an input, and produces an actor tensor. like, A(s) -> a.
while Q receives the same input, plus the actor tensor a which is A produced in above step. and produces a score.
Like Q(s, a) -> q
And now I’m doing back propgation on these two networks. To optimize A, the formula is simple:
ce_optimizer = torch.optim.Adam(tn.critic_e.parameters(), lr=tn.LR_C) ae_optimizer = torch.optim.Adam(tn.actor_e.parameters(), lr=tn.LR_A) ... # optimize the critic network, which is easy # optimize the actor network ae_optimizer.zero_grad() ce_optimizer.zero_grad() # since we optimized critic network in above part, a_pred = tn.actor_e(patch, hmap) q_eval = tn.critic_e(patch, hmap, a_pred) q_eval.backward(torch.tensor([[1.0]]).to(device)) ae_optimizer.step() # note using optimizer for actor network
Is the logic here correct? Do one forward process on A then Q, and back propgate that, once A gets its grads, simply using corresponding optimizer to do step.