Policy outputs the same thing for any state

So I trained a PPO model using torchrl and what’s happening is that douring training the model has very little variation in it’s actions but during validation it’s the exact same vector of actions every time, no variation whart so ever.

I just wanted to ask if this is a common problem in RL or if I should be looking for a bug in my code?

Here’s some more details: my observation space is of shape (309, ) and the action space is of shape (103,) both continuous, but actions being limited [0, 1]. I do train them as [0, 1], I heard it’s better to do [-1, 1] so maybe that’s the issue.

This is my PPO config:

name: ppo

  width: 512
  number_hidden: 3

  width: 512
  number_hidden: 3

clip_epsilon: 0.2
critic_coef: 1.0
entropy_eps: 1.0e-4
frames_per_batch: 1024
gamma: 0.99
lmbda: 0.95
learning_rate: 3.0e-3
loss_critic_type: smooth_l1
max_grad_norm: 1.0
num_epochs: 256
sub_batch_size: 512

Hi @Viktor_Todosijevic
Can you show how you build your model?
I’d be glad to see what the policy looks like