Hello,
I’m working on a PPO implementation. It is a RL algorithm in which we constraint the policy to be updated in the neighborhood of the previous policy.
To do so, the algorithm relies on a ratio that evaluates the differences of probabilities between old and new policy.
So, in my implementation I:
- Compute the log prob coming from the old policy
- Compute the log prob coming from the new policy
- Evaluate the ratio
- Clone the new policy into the old one
- Update the new policy
I realize that my ratio is always absolutely equal to one. Could it be because I copy the network using this method:
for p_source, p_target in zip(self.parameters(), clone.parameters()):
p_target.data = p_source.data
Thanks !