Very simple environment with continuous action space fails to learn effectively with PPO

I will post a link to the minimal reproduction code, but the env is really simple.
(link: Minimal PPO training environment (

A flat 2D world [0,1]×[0,1] has an agent and a target destination it needs to get to.
Thus the observation space is continuous and has shape (4,) {x0,y0,x1,y1}.
The action space is essentially a direction to step in, and how large of a step to make.
Neither of these actions compound with the previous actions, they stand on their own.
Both actions are in the range [-1,1]. Thus the action space is continuous and has shape (2,).
Location is clipped at the world boundaries to be within [0,1] as well.

I’ve tried running the exact same PPO method to train the BipedalWalker-v3 gym env, and it performs extremely well and steadily improves the average reward until the environment is essentially solved.

However, switching over to my environment, it performs abysmally.
After 1 million time steps (much longer than the bipedalwalker needs for good results) it only reaches the destination less than half the time, and even then, most of them look like it reached it by accident.

The other half the time it will just burrow itself in a corner, or try to go through the world border even though it’s clipped and doesn’t get anywhere.

I’d think my environment is dramatically simpler and easier to learn, so why does it not work at all?

I’ve tested it can go in all direction, and always reach its target well within the step limit if it wanted to. Episode is truncated after 500 time steps with a -10 reward, +10 reward for reaching the target, and otherwise, the reward is just -dist where dist is the distance from the agent to the target.

Even a random sample (random walk) reaches the target relatively often.

What is wrong with my environment / approach?

In the case of the bipedal walker, it’s objective is to stand and take steps, irrespective of the direction. And so it doesn’t have any visual information about the surrounding environment as an input. Or any concept of location.

But in your example, it seems you want the agent to reach a physical location.

What information about the surrounding environment is given to your model?

The observation space is the four x,y coordinates of both the agent and the target, that’s what it knows about it’s environment.
And the reward like I said is just the negative distance between it and the target, so the agent would want to close the gap as quick as possible I’d assume.

# Distance from target, reward is simply negtive distance.
        target_dist = float(norm(self._agent_location - self._destination))
        reward: float = -target_dist

I think this might be an incorrect formation of the reward, in your case. Ideally, you want to reward the model for decreasing the distance to the target. This means you should calculate the delta between the start distance and the end distance on that step.

Consider this, when the model is far from the target, the current function is communicating to the model it needs to change parameters ALOT(likely resulting in exploding gradients). But how did it’s actions on that time step determine where it would start? While, when it is very close to the target, it’s changing the parameters very little. And there is no correlation between the actions it took on that time step and the total distance it just happens to be on that time step.

But if you give the model the difference between the distance_start - distance_end of each step, you then have a reward correlated to the actions taken that step.

First I want to say thank you for your help :slight_smile:

That reward was just the one I saw being used in a similar (using DQN) model that’s even a bit more sophisticated: PuckWorld (

I previously tried normalizing the distance based on the initial distance, I also tried what you said:

        prev_target_dist = float(norm(self._last_location - self._destination))
        target_dist = float(norm(self._agent_location - self._destination))
        reward: float = prev_target_dist - target_dist

and still it performs erratically even after 1,000,000 steps. I feel there must be something else flawed in my env.

I also tried controlling the agent with manual keyboard input and found nothing inconsistent about its movement based on the action values.

Thanks for trying that. I had a chance to run your code with the above change. I made a few modifications to your code overall:

  1. Added a way to track the average number of steps per episode;
  2. Reduced the reward for finding the target from 10 —> 0.2;
  3. I increased the learning rate by a factor of 10x.

Here is the result from tracking:

It appears to me to be improving.

The new metrics were added with the following lines of code:

step = 0
episode = 0

steps_memory = torch.tensor([]) # new line
steps_save = np.empty((0,2)) # new line
while step < TRAINING_LEN:
        if step % SAVE_FREQ == 0:

       # new code below
        if term|trunc:
            steps_memory =[steps_memory[-100:],torch.tensor([t])])
            avg_steps = torch.mean(steps_memory)
            step_save = np.array([[episode,avg_steps]])
            steps_save = np.concatenate([steps_save, step_save])
        # new code above
        if term: print(f'\nAchieved goal (Eps: {episode}; in {t} steps) (Reward: {reward} / {total_reward}) (Avg Steps: {avg_steps})')

# new code below
rows = ["{},{}".format(i, j) for i, j in steps_save]
text = "\n".join(rows)

with open('steps_save.csv', 'w') as f:

# new code above

wait: str = input('run sim? (y/n): ')

With that said, I think you could improve your approach with the following changes:

  1. Change the Action Space to be a total of 5, where 1 thru 3 are a probability of rotating 10 degrees left, not rotating, or rotating 10 degrees right(get the max value for the action chosen); and the other two the probability for whether to move forward a fixed amount in the orientation after the rotation action has been applied(i.e. move vs. don’t);
  2. The above assumes you change the rotation to be additive;
  3. You’ll also need to include orientation into the observation state, since it would be additive;
  4. Update the reward function to add one more component: pi - abs(agent_orientation - target_angle_from_agent); multiply this by an alpha to ensure it’s equivalent to the distance reward;

Alternatively, you could do the same as Stanford in your link and make the action space 5 values, where 1 - 4 is a probability of left, up, right, or down. And the final value would be a scale of 0-1 for the velocity(granted, they included velocity, and acceleration in their function; likely three values where 0 is decelerate by a scalar, 1 is no acceleration, and 2 is accelerate by a scalar). When you frame the problem in probabilities, it tends to be easier for networks to pick up on a best solution more quickly.

1 Like

Thank you very much for taking the time to run my code and modify it.

I suppose I need to be more sensitive about the rewards in the future.

May I also ask if I can avoid discretizing the actions such as “rotate 5 deg” as opposed to “rotate x degree” with x given by the model, while still framing it as a set of probabilities? Maybe interpolate between the two most likely rotation wrt their probabilities?
In case it isn’t convenient to divide them into discrete units.
I don’t believe the BipedalWalker’s actions are discretized like that, could that be an improvement?

You can set it up however you like and the model will learn something, just as in your current setup.

Just speaking from experience, models tend to converge faster with discrete values, where the model is just deciding the probability for a given action choice.

In the real world, you can always reduce your timestep, with the only limitation being compute speed. But with a smaller model, the compute speed is faster, so that’s the trade off. If you cut the discrete value into a small enough size, it looks and acts continuous on larger time frames.

1 Like