I implemented an actor critic algorithm, very much inspired from PyTorch’s one. It is doing awesome in CartPole, for instance, getting over 190 in a few hundred iterations. Delighted from this, I prepared for using it in my very own environment in which a robot has to touch a point in space.
In the simplest case:
- One single and fixed target
- Oriented reward (reward = 1/distance from effector to target)
It is not even capable to get over 50%. Furthermore, the performance drops, falling from 20% to 2~3%. It is something that puzzles me. The environment is exactly the same I used for REINFORCE with baseline in which I’m able to get around 100% of success.
Would anyone have an idea about where I could find the flaw ? I must admit, I do not understand. I really thought that given REINFORCE with baseline performances, it would be a piece of cake for AC.
Among the various parameters that can cause this implementation to fail, I’m considering that the fact that the critic is not using experience replay -> hence, introducing a lot of variance when bootstrapping.
But if it is so, how come it works so well on Cartpole ? Is it because cartpole is so simple that the actor can find it’s way without listening to the critic until the critic finally gets it right ?
I’m going to add experience replay and share my results.
PS: Link to my github with the files: https://github.com/Mehd6384/Robot-World---RL