This line centralizes the rewards, is there a specific reason, since the original algorithm does not mention the centralization.
This is a classic trick used in a lot of different papers, normalizing the rewards really speeds up learning a lot.
This line centralizes the rewards, is there a specific reason, since the original algorithm does not mention the centralization.
This is a classic trick used in a lot of different papers, normalizing the rewards really speeds up learning a lot.