I am trying to build a model for a policy and have a general question regarding the initial values of the policy.
It looks to me that at from the first forward step the softmax will not return equal probability.
Questions:
- how can I force the policy to be equal probability for all the actions at the first step?
- what would be a good method to increase exploration when training the policy?
- what are some of the ways one can apply constraints on the actions? for example if a specific state makes one of the actions impossible, beside setting it in the rewards in a way of a penalty (restraining the actions)
example policy:
class Policy(nn.Module):
def __init__(self, state_size, action_size, fc1_size=64):
super().__init__()
self.fc1 = nn.Linear(state_size, fc1_size)
self.fc2 = nn.Linear(fc1_size, action_size)
def forward(self, state):
x = F.relu(self.fc1(state))
x = self.fc2(x)
x = F.softmax(x, dim=1)
return x
def act(self, state):
state = state.unsqueeze(0)
probs = self.forward(state).cpu()
p = Categorical(probs)
action = p.sample()
log_prob = p.log_prob(action)
return action.item(), log_prob