So I trained a PPO model using torchrl and what’s happening is that douring training the model has very little variation in it’s actions but during validation it’s the exact same vector of actions every time, no variation whart so ever.
I just wanted to ask if this is a common problem in RL or if I should be looking for a bug in my code?
Here’s some more details: my observation space is of shape (309, ) and the action space is of shape (103,) both continuous, but actions being limited [0, 1]. I do train them as [0, 1], I heard it’s better to do [-1, 1] so maybe that’s the issue.
This is my PPO config:
name: ppo actor: width: 512 number_hidden: 3 critic: width: 512 number_hidden: 3 clip_epsilon: 0.2 critic_coef: 1.0 entropy_eps: 1.0e-4 frames_per_batch: 1024 gamma: 0.99 lmbda: 0.95 learning_rate: 3.0e-3 loss_critic_type: smooth_l1 max_grad_norm: 1.0 num_epochs: 256 sub_batch_size: 512