What is the stopping condition for Reinforcement Learning (RL) algorithms in practice?

Take as an example the typical way to run an RL agent learning with Policy gradient:

def main():
    running_reward = 10
    for i_episode in count(1):
        state = env.reset()
        for t in range(10000):  # Don't infinite loop while learning
            action = select_action(state)
            state, reward, done, _ = env.step(action)
            if args.render:
            if done:

        running_reward = running_reward * 0.99 + t * 0.01
        if i_episode % args.log_interval == 0:
            print('Episode {}\tLast length: {:5d}\tAverage length: {:.2f}'.format(
                i_episode, t, running_reward))
        if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                  "the last episode runs to {} time steps!".format(running_reward, t))

Can someone explain me why thats the stopping condition for the RL environment?

In particular how running_reward > env.spec.reward_threshold is chosen and why its running_reward = running_reward * 0.99 + t * 0.01 and not running_reward = running_reward * 0.99 + reward * 0.01. Not sure if there as any other small details worth asking besides that…

I’ve seen somewhere else as a stopping condition:

        # Calculate score to determine when the environment has been solved
        ## TODO what the heck is this? why?
        scores_last_100 = scores[-100:] # from the last 100 to the end
        mean_score = np.mean(scores_last_100)
        if episode % 50 == 0:
            print(f'Episode {episode}\tAverage length (last 100 episodes): {mean_score}')
        if mean_score > env.spec.reward_threshold:
            print(f'Solved after {episode} episodes! Running average is now {mean_score}. Last episode ran to {time} time steps.')

can someone also explain me that one?

git issue: