When performing env.rollout, it internally calls the policy module, with a sample at each step.
It is in the shape of ovservation, without any batch_size dimension.
To keep everything more consistent, is there a way to add a batch_size dimension of 1, instead of unsqueezing in the model itself?