Simple direct policy search suffers from wrong gradients?

Currently I am experimented a simple direct policy search in OpenAI Gym Pendulum-v0 in PyTorch.

The approach is:

  1. Initialize MLP policy with single hidden layer, 50 neurons, ReLu nonlinearity and output 1 continuous action value.

  2. True differentiable dynamics model (copy Gym code to PyTorch)

  3. Repeat

        3.1. Rollout trajectory for T=50 time steps by policy and compute accumulated costs J.
    
        3.2. J.backward() and update policy
    

First of all, the policy learns something reasonable by swing the pole and pull up. But then nothing better to be learned, even though there are still gradients.

It is quite confusing where might be the problem, I’ve tried different hidden size, learning rates or max torques of Pendulum.

BTW, the cost function is 0-1 saturation cost, i.e. 1 - exp(-0.5*x^2/s^2) where s is a scaling factor.