Confused about Categorical logits and categorical dist: Sample() delivers different results

wilhelm · January 28, 2024, 10:57am

I was trying to implement some RL code which uses “Categorical(probs)” in combination with “softmax” to sample one action (by the way, the environment used is CartPole-v1 from OpenAI (Gymnasium)).

But since I changed the reference code in the repository in order to use “Categorical(logits)” instead of using “softmax” + “Categorical(probs)”, I realized that I cannot achieve the same results.

So I try to implement both method to find some differences.
What I found is that the method “sample()” from Categorical delivers different results, when used with logits or probs, even if the input is the same:

If you run the following code in any environemnt:

def choose_action(self, state):

    state = torch.from_numpy(state).float().to(self.policy_network.device)

    logits = self.policy_network(state)
    probs = self.softmax(self.policy_network(state))

    policy_dist = Categorical(logits = logits)
    policy_probs = Categorical(probs = probs)

    action = policy_dist.sample()
    action_p = policy_probs.sample()

    log_prob = policy_dist.log_prob(action)
    lgp = policy_probs.log_prob(action_p)

    return action.cpu().numpy(), action_p.cpu().numpy(), log_prob, lgp

where “state” is a torch.tensor.from_numpy, then “action” and “action_p” are different! In the same step you get “action = 0” but “action_p = 1”, which is not possible, because the action in the same step must be the same.

I other words: I would expect that action and action_p contain the same sampled value (“0” or “1”) since are samped from two different distributions but calculated with the same input data.
Or am I wrong?

Thanks

Here the network policy_network:

import torch.nn.functional as F
import torch.optim as optimizer
import torch.nn as nn
import torch
import os

class ReinforceNetwork(nn.Module):

def __init__(self, input_dims, layers, n_actions, lr, directory, name, gpu):

    super(ReinforceNetwork, self).__init__()

    self.input_dims = input_dims
    self.layers = layers
    self.n_actions = n_actions
    self.lr = lr
    self.directory = directory
    self.name = name
    self.gpu = gpu

    self.checkpoint_file = os.path.join(self.directory, self.name + "_0" + ".pth")

    self.fc1 = nn.Linear(in_features = self.input_dims, out_features = self.layers[0])
    self.fc2 = nn.Linear(in_features = self.layers[0], out_features = self.layers[1])
    self.mu = nn.Linear(in_features = self.layers[1], out_features = n_actions)

    self.relu = nn.ReLU()

    self.optimizer = optimizer.Adam(params = self.parameters(), lr = self.lr)

    self.device = torch.device(f"cuda:{self.gpu}" if torch.cuda.is_available() else "cpu")
    self.to(self.device)

def forward(self, state):

    state = self.fc1(state)
    state = F.leaky_relu(state)
    state = self.fc2(state)
    state = F.leaky_relu(state)
    mu = self.mu(state)

    return mu

vmoens · January 28, 2024, 12:21pm

By definition a sample is a pseudo random number from the distribution. Even from the same distribution, two successive samples should have a different value (except in degenerate cases or by chance, like getting two six after throwing a dice twice)

wilhelm · January 28, 2024, 12:38pm

Ah, ok!
BUT my code is not wrong, right?
I can sample from a distribution using logits or probs (the latter after the softmax operation), right?

vmoens · January 28, 2024, 12:42pm

Yes both options are valid, rule of thumbs is that logits are more numerically stable but otherwise things should be ok!