My question follows my examination of the code in the PyTorch DQN tutorial, but then refers to Reinforcement Learning in general: what are the best practices for optimal exploration/exploitation in reinforcement learning?

In the DQN tutorial, the steps_done variable is a global variable, and the EPS_DECAY = 200. This means that:

after 128 steps, the epsilon threshold = 0.500

after 889 steps, the epsilon threshold = 0.0600

after 1500 steps, the epsilon threshold = 0.05047

This might work for the CartPole problem featured in the tutorial – where the early episodes might be very short and the task fairly simple – but what about on more complex problems in which far more exploration is required? For example, if we had a problem with 40,000 episodes, each of which had 10,000 timesteps, how would we set up the epsilon greedy exploration policy? Is there some rule of thumb that’s used in RL work?

Thank you in advance for any help.