Understanding Enropy

I’m a student working on a PPO implementation and trying to understand how entropy is inferred in the following examples. I have a basic understanding of what entropy is after watching Aurelian Geron’s video. (Sorry. I’m a newb and not allowed to overuse links)

In shangtongzhang’s implementation of PPO,

def __init__(self,
                 state_dim,
                 action_dim,
                 phi_body=None,
                 actor_body=None,
                 critic_body=None):
        super(GaussianActorCriticNet, self).__init__()
        self.network = ActorCriticNet(state_dim, action_dim, phi_body, actor_body, critic_body)
        self.std = nn.Parameter(torch.zeros(action_dim))
        self.to(Config.DEVICE)

    def forward(self, obs, action=None):
        obs = tensor(obs)
        phi = self.network.phi_body(obs)
        phi_a = self.network.actor_body(phi)
        phi_v = self.network.critic_body(phi)
        mean = F.tanh(self.network.fc_action(phi_a))
        v = self.network.fc_critic(phi_v)
        dist = torch.distributions.Normal(mean, F.softplus(self.std))
        if action is None:
            action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1).unsqueeze(-1)
        entropy = dist.entropy().sum(-1).unsqueeze(-1)

The distribution is driven by the line self.std = nn.Parameter(torch.zeros(action_dim)) where the softplus function is applied to this to get a distribution. Isn’t this just a bunch of zeros that never change?

This suggests that entropy is a fixed property or basically specified by the creator of the network. If so, I’m struggling to see the point of adding an entropy term to the advantage. I’ve seen implementations that use a parameter in place of the zeros. This makes more sense to me if entropy is basically a specified and static parameter.

In this implementation:


        self.fc_actor_mean = nn.Linear(256, self.action_dim)
        self.fc_actor_std = nn.Linear(256, self.action_dim)
        self.fc_critic = nn.Linear(256, 1)

        self.std = nn.Parameter(torch.zeros(1, action_dim))

    def forward(self, x, action=None):
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))

        # Actor
        mean = torch.tanh(self.fc_actor_mean(x))
        std = F.softplus(self.fc_actor_std(x))
        dist = torch.distributions.Normal(mean, std)
        if action is None:
            action = dist.sample()
        log_prob = dist.log_prob(action)

        # Critic
        # State value V(s)
        v = self.fc_critic(x)

the distribution is taken as one of the heads of the network rather than a bunch of zeros. This is conceptually more intuitive in the sense that the distribution and therefore entropy is dynamic and a property of the network and its parameters. Is this a valid way of doing this and if so, how is the distribution of the outputs of this head conditioned to be relevant?

In this code, self.std should be named self.logit_std. That’s why it is given to be eaten by a softplus thing.
Only the output of F.softplus(std) is the standard deviation.

The fact it’s a parameter (nn.Parameter) says to pytorch that it must be learned (so this value is not fixed and will adapt in order to maximize the reward, minus a lagrangian penalty for entropy).

So the entropy (that as you said, closely depends on this variable, but also on the means) will be learned, and using such a trick is quite efficient (rather than the second code where the standard deviation depends on the input). The standard deviation is, if you want, the tradeoff between exploration and exploitation: the smaller the std, the less exploration.

And the entropy penalty in the loss is used to regulate the learning rate : the main role is to avoid fast convergence to local minima with deterministic solutions.

Thank you sincerely! I have many hours invested in trying to grok this one detail.