Should action log-probability computed after or before constraining the action?

Suppose we implement a Gaussian policy, and we would like to constraint the sampled action with upper/lower bound (e.g. [-2, 2], we do it by 2*torch.tanh(action)). Then it raises up a question, where should we compute the log-probability, after or before the action constraint ?

If you do the clipping after the sampling, the distribution is no longer gaussian (with a peak at the closest constraint to the mean).

If you sample and then you apply tanh after sampling, then the distribution is even less gaussian, and is too much distorted to estimate the gradient of the return.

I would suggest the first solution, but you have cleaner tricks (when you sample outside the limit, you go back from the mean, like in Nokia’s snake game, the resulting distribution is closer to a gaussian).