I’m a student working on a PPO implementation and trying to understand how entropy is inferred in the following examples. I have a basic understanding of what entropy is after watching Aurelian Geron’s video. (Sorry. I’m a newb and not allowed to overuse links)
In shangtongzhang’s implementation of PPO,
def __init__(self, state_dim, action_dim, phi_body=None, actor_body=None, critic_body=None): super(GaussianActorCriticNet, self).__init__() self.network = ActorCriticNet(state_dim, action_dim, phi_body, actor_body, critic_body) self.std = nn.Parameter(torch.zeros(action_dim)) self.to(Config.DEVICE) def forward(self, obs, action=None): obs = tensor(obs) phi = self.network.phi_body(obs) phi_a = self.network.actor_body(phi) phi_v = self.network.critic_body(phi) mean = F.tanh(self.network.fc_action(phi_a)) v = self.network.fc_critic(phi_v) dist = torch.distributions.Normal(mean, F.softplus(self.std)) if action is None: action = dist.sample() log_prob = dist.log_prob(action).sum(-1).unsqueeze(-1) entropy = dist.entropy().sum(-1).unsqueeze(-1)
The distribution is driven by the line
self.std = nn.Parameter(torch.zeros(action_dim)) where the softplus function is applied to this to get a distribution. Isn’t this just a bunch of zeros that never change?
This suggests that entropy is a fixed property or basically specified by the creator of the network. If so, I’m struggling to see the point of adding an entropy term to the advantage. I’ve seen implementations that use a parameter in place of the zeros. This makes more sense to me if entropy is basically a specified and static parameter.
In this implementation:
self.fc_actor_mean = nn.Linear(256, self.action_dim) self.fc_actor_std = nn.Linear(256, self.action_dim) self.fc_critic = nn.Linear(256, 1) self.std = nn.Parameter(torch.zeros(1, action_dim)) def forward(self, x, action=None): x = F.relu(self.fc1(x)) x = F.relu(self.fc2(x)) # Actor mean = torch.tanh(self.fc_actor_mean(x)) std = F.softplus(self.fc_actor_std(x)) dist = torch.distributions.Normal(mean, std) if action is None: action = dist.sample() log_prob = dist.log_prob(action) # Critic # State value V(s) v = self.fc_critic(x)
the distribution is taken as one of the heads of the network rather than a bunch of zeros. This is conceptually more intuitive in the sense that the distribution and therefore entropy is dynamic and a property of the network and its parameters. Is this a valid way of doing this and if so, how is the distribution of the outputs of this head conditioned to be relevant?