I am trying to build a model for a policy and have a general question regarding the initial values of the policy.
It looks to me that at from the first forward step the softmax will not return equal probability.
- how can I force the policy to be equal probability for all the actions at the first step?
- what would be a good method to increase exploration when training the policy?
- what are some of the ways one can apply constraints on the actions? for example if a specific state makes one of the actions impossible, beside setting it in the rewards in a way of a penalty (restraining the actions)
class Policy(nn.Module): def __init__(self, state_size, action_size, fc1_size=64): super().__init__() self.fc1 = nn.Linear(state_size, fc1_size) self.fc2 = nn.Linear(fc1_size, action_size) def forward(self, state): x = F.relu(self.fc1(state)) x = self.fc2(x) x = F.softmax(x, dim=1) return x def act(self, state): state = state.unsqueeze(0) probs = self.forward(state).cpu() p = Categorical(probs) action = p.sample() log_prob = p.log_prob(action) return action.item(), log_prob