Hi,

I am building my first REINFORCE (policy gradient) model with a continuous action space between 0 and 1. Right now, I use the following code for this:

```
self.loc_layer = nn.Sequential(
nn.Linear(size, n_params),
nn.Sigmoid()
)
self.scale_layer = nn.Sequential(
nn.Linear(size, n_params),
nn.Sigmoid()
)
# ...
loc = self.loc_layer(x)
if self.training:
scale = self.scale_layer(x)
dist = Normal(loc=loc, scale=scale)
params = dist.sample()
log_prob = dist.log_prob(params).sum(dim=1)
params = params.clamp(min=0, max=1)
else:
params = loc
log_prob = None
```

However, this does not seem right to me. For example, I do not like that I have to clamp to ensure that the actions (`params`

) are between 0 and 1. Therefore, I consider using a Beta or Logit-Normal distribution instead. In other examples, I have seen that Softplus is often used as the activation function for the scale argument. But this does not seem right here because I want to limit the action space.

Can anyone with some experience in this area give me a recommendation to which activation function and distribution to use in this case?