I am working on an Actor Critic Method where there are infinite numbers of possible actions (let’s say movement of a robot, where the possible movement is not just left-right-up-down but every possible angle).
My obvious idea was to have output mu and sigma^2 in the policy:
class Policy(nn.Module): """ implements both actor and critic in one model """ def __init__(self): super(Policy, self).__init__() self.fc1 = nn.Linear(state_size+1, 128) self.fc2 = nn.Linear(128, 64) # actor's layer self.action_head = nn.Linear(64, 1) self.mu = nn.Sigmoid() self.var = nn.Softplus() # critic's layer self.value_head = nn.Linear(64, 1) def forward(self, x): """ forward of both actor and critic """ x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) action_prob = self.action_head(x) mu = self.mu(action_prob) var = self.var(action_prob) state_values = self.value_head(x) return mu, var, state_values
Now, I also need to calculate the action from this (mu, sigma**2=var) pair. I also need to calculate the probability of happening of that action from that Normal Distribution. I am doing these for them:
sigma = torch.sqrt(var) action = torch.normal(mu, sigma) action = torch.clip(action, 0, 1) pdf_probability = stats.norm.pdf(action.cpu().detach().numpy(), loc=mu.cpu().detach().numpy(), scale=sigma.cpu().detach().numpy())
Now I have some questions. What I am doing, is it ok? It does not feel ok, as I suppose the pdf_probability should be backpropagated as well, but during conversion Tensor->Numpy->Tensor, we lose it.
I have gone through [resolved] Actor Critic with a large amount of possible actions - reinforcement-learning - PyTorch Forums, where they discussed a similar issue. They are also calculating mu, sigma^2 from the policy, but they never talked about how to calculate Action from these (mu,sigma^2).
I also read the A3C paper Asynchronous Methods for Deep Reinforcement Learning (arxiv.org), where they stated that the mu should be calculated by a Linear Layer. Is it mandatory? In my case, I want to only consider positive angles, so, mu should be followed by Sigmoid or ReLU right?