What is the justification for normalizing each episode's reward targets in the policy gradient examples?

Hi all,

I’m confused about this line in the actor_critic.py example. It also appears in Andrej Karpathy’s Pong w/Pixels code. Is there any good justification to normalize the reward targets on a per episode basis? My understanding is that normalizing reward values should be done over batches or over all previous episodic rewards encountered during training, in order to keep the RL task stationary. I haven’t been able to find anything regarding this per episode normalization in the literature.

I have the same question.

Intuitively, if we normalize over individual episodes, then episodes with a high overall reward will be normalized into the same range as episodes with very poor rewards.

For example, consider we take actions at states that give us a near constant -50 reward for all time steps, and another other where we receive +50. Then, all actions in both episodes will be reinforced to the same degree after we z-score them. This will still properly reinforce on a per action basis within episodes, but effectively reduce all episodes to having the same amount of positive and negative reinforcement. I can only see this per action normalization working in a true MDP environment.