So I trained a PPO model using torchrl and what’s happening is that douring training the model has very little variation in it’s actions but during validation it’s the exact same vector of actions every time, no variation whart so ever.
I just wanted to ask if this is a common problem in RL or if I should be looking for a bug in my code?
Here’s some more details: my observation space is of shape (309, ) and the action space is of shape (103,) both continuous, but actions being limited [0, 1]. I do train them as [0, 1], I heard it’s better to do [-1, 1] so maybe that’s the issue.
This is my PPO config:
name: ppo
actor:
width: 512
number_hidden: 3
critic:
width: 512
number_hidden: 3
clip_epsilon: 0.2
critic_coef: 1.0
entropy_eps: 1.0e-4
frames_per_batch: 1024
gamma: 0.99
lmbda: 0.95
learning_rate: 3.0e-3
loss_critic_type: smooth_l1
max_grad_norm: 1.0
num_epochs: 256
sub_batch_size: 512