Definitely, it’s way harder to train than training your model with the observations, which in that particular case of
CartPole-v0 return the underlying physical model of the game. However, it is also way cooler, as it demonstrates the power of this algorithm
However, I’d like to mention that the original Deep Q Learning algorithm described by Mnih et al. didn’t use the difference of previous and current frame as a state representation, but rather a stack of the 4 last seen and processed frames (Resulting in a 1x4x84x84 input tensor as they were training on gray scale images). Also, they leveraged a technique called frame skipping:
The agent selects an action on every kth frame (I think it was 4th in the paper) and this action is applied to the environment for k-1 frames. This helps the agent to see more frames in the same time, as computing an action requires significantly more computing power than stepping the environment.
Also, the paper mentioned priorly deployed a second ‘target network’, which would be updated every 10.000 frames, which additionally stabilizes the algorithm.