Initializing policy - reinforcment learning

youdar · September 14, 2020, 1:32am

I am trying to build a model for a policy and have a general question regarding the initial values of the policy.
It looks to me that at from the first forward step the softmax will not return equal probability.
Questions:

how can I force the policy to be equal probability for all the actions at the first step?
what would be a good method to increase exploration when training the policy?
what are some of the ways one can apply constraints on the actions? for example if a specific state makes one of the actions impossible, beside setting it in the rewards in a way of a penalty (restraining the actions)

example policy:

class Policy(nn.Module):
    def __init__(self, state_size, action_size, fc1_size=64):
        super().__init__()
        self.fc1 = nn.Linear(state_size, fc1_size)
        self.fc2 = nn.Linear(fc1_size, action_size)

    def forward(self, state):
        x = F.relu(self.fc1(state))
        x = self.fc2(x)
        x = F.softmax(x, dim=1)
        return x

    def act(self, state):
        state = state.unsqueeze(0)
        probs = self.forward(state).cpu()
        p = Categorical(probs)
        action = p.sample()
        log_prob = p.log_prob(action)
        return action.item(), log_prob

AdilZouitine · December 9, 2020, 12:49pm

Good question, I am also trying to do the same thing but I have not been able to do it.