I have two same network A, B, which A is one step backprop behind of B.
So B is one step ahead.
- I am making samples from A and B, grad tensor “result” came from each networks’ forward function: result-A, result-B
- Copying B’s weights to A,
- and want to update both gradients to B therefore B is one step ahead again.
To do #3 I concatenated result-A and result-B but I realized that it wouldn’t work like that.
How can I update both result-A and result-B to B?
Just someone might wonder why I am trying to do the above:
I am coding to make distributed learning for reinforcement learning, where multiple ‘player’ agents only produce the outputs from inputs, and apply to ‘trainer’ agent.
They all have same networks but different weights, all players might have weights of trainer at less than or equal to trainer agent.
If I use replay buffer like off-policy module it is easy - requiring just coding since it only replaces the reward at the action representing dimension, find loss, do backward.
But if it is on-policy using Categorical log_prob then it would be hard since the action candidates and actions are produced from the previous weights, not current one. I don’t think the trainer would update correctly when I simply change the values …