The original max torques is +/- 2 with max speed +/- 8, according to some solutions, it needs to swing several times to balance upward. I guess it is not solvable by vanilla policy gradient with 1 layer MLP with 50 neurons. What might be good values for max torques and max speed such that the pendulum needs to swing only once or twice to balance upward ?
Out of curiosity, were you able to learn Pendulum-v0 with policy gradient?
I changed the max/min torque to +8/-8 and still unable to solve it with REINFORCE or REINFORCE with a baseline. Maybe I need to tune it more.
Indeed, REINFORCE is not that great in order to learn features through linear layers. Adding a prediction of values increases the speed to learn relevant features in the hidden layer. That’s why actor-critic is much more stable, and still work with 30 neurons.