i am implementing an actor-critic - MP-DQN. Now in the toy environment it works great and the policy network produces values between -1 and 1. If i use the same network with same hyper parameters in my own environment the values of the policy network going up and up or down and down. After tanh squashing the parameters are stuck at 1 or -1 and don’t really produce parameters based on the actual state. So that the agent only has for example the option to turn full left or only to turn full right. What might be the problem? I am new to this field and idk what to look for. What do i have to check/look for? What do i have to adjust? Would a hyperparameter obtimisation work or hast it to do with the input data? The input data is normalized between -1 and 1 and some between 0 and 1.
Thanks for your help!