Take as an example the typical way to run an RL agent learning with Policy gradient:
def main():
running_reward = 10
for i_episode in count(1):
state = env.reset()
for t in range(10000): # Don't infinite loop while learning
action = select_action(state)
state, reward, done, _ = env.step(action)
if args.render:
env.render()
policy.rewards.append(reward)
if done:
break
running_reward = running_reward * 0.99 + t * 0.01
finish_episode()
if i_episode % args.log_interval == 0:
print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(
i_episode, t, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and "
"the last episode runs to {} time steps!".format(running_reward, t))
break
Can someone explain me why thats the stopping condition for the RL environment?
In particular how running_reward > env.spec.reward_threshold
is chosen and why its running_reward = running_reward * 0.99 + t * 0.01
and not running_reward = running_reward * 0.99 + reward * 0.01
. Not sure if there as any other small details worth asking besides that…
I’ve seen somewhere else as a stopping condition:
# Calculate score to determine when the environment has been solved
## TODO what the heck is this? why?
scores.append(time)
scores_last_100 = scores[-100:] # from the last 100 to the end
mean_score = np.mean(scores_last_100)
if episode % 50 == 0:
print(f'Episode {episode}\tAverage length (last 100 episodes): {mean_score}')
if mean_score > env.spec.reward_threshold:
print(f'Solved after {episode} episodes! Running average is now {mean_score}. Last episode ran to {time} time steps.')
break
can someone also explain me that one?
git issue: