I’m using Reinforcement Learning to train two networks, one is an actor network, noted as A, and the other is a critic network, noted as Q.

The actor network A receives an input, and produces an actor tensor. like, A(s) -> a.

while Q receives the same input, plus the actor tensor a which is A produced in above step. and produces a score.

Like Q(s, a) -> q

And now I’m doing back propgation on these two networks. To optimize A, the formula is simple:

```
ce_optimizer = torch.optim.Adam(tn.critic_e.parameters(), lr=tn.LR_C)
ae_optimizer = torch.optim.Adam(tn.actor_e.parameters(), lr=tn.LR_A)
... # optimize the critic network, which is easy
# optimize the actor network
ae_optimizer.zero_grad()
ce_optimizer.zero_grad() # since we optimized critic network in above part,
a_pred = tn.actor_e(patch, hmap)
q_eval = tn.critic_e(patch, hmap, a_pred)
q_eval.backward(torch.tensor([[1.0]]).to(device))
ae_optimizer.step() # note using optimizer for actor network
```

Is the logic here correct? Do one forward process on A then Q, and back propgate that, once A gets its grads, simply using corresponding optimizer to do step.