I looked on the code of DQN solution for CartPole:
And there are some issues I don’t understand:
It seems that at each episode and at each step the optimize_model is called.
i.e on each step the model is fitted with the next values (GT).
Why is this right ?
let’s say in the first X steps, the random steps give us the right move action and in this case we are too close that pole will fall or 2.4 units away from the center, so we trained the model with X rewards (for each step) but it can direct us to the wrong solution, am I right ?
I saw other implementations of DQN (CartPole) which used one model (and not 2 as we saw in the example above): https://towardsdatascience.com/deep-q-networks-theory-and-implementation-37543f60dd67
What is the benefit to use 2 models and not one model ?
We get to done state at each epoch.
So we can get to done after small number of steps (the pole fell too early) and in other epochs we can get to reward with higher number of steps (and rewards) (we saved the pole not to fall for too long).
In both cases, the model is fitted (and wights are updated).
Why we fit the model for those 2 cases and not just for epochs which the reward is higher than the previous epoch (or best epoch) rewards?