Confused about Categorical logits and categorical dist: Sample() delivers different results

I was trying to implement some RL code which uses “Categorical(probs)” in combination with “softmax” to sample one action (by the way, the environment used is CartPole-v1 from OpenAI (Gymnasium)).

But since I changed the reference code in the repository in order to use “Categorical(logits)” instead of using “softmax” + “Categorical(probs)”, I realized that I cannot achieve the same results.

So I try to implement both method to find some differences.
What I found is that the method “sample()” from Categorical delivers different results, when used with logits or probs, even if the input is the same:

If you run the following code in any environemnt:

def choose_action(self, state):

    state = torch.from_numpy(state).float().to(self.policy_network.device)

    logits = self.policy_network(state)
    probs = self.softmax(self.policy_network(state))

    policy_dist = Categorical(logits = logits)
    policy_probs = Categorical(probs = probs)

    action = policy_dist.sample()
    action_p = policy_probs.sample()

    log_prob = policy_dist.log_prob(action)
    lgp = policy_probs.log_prob(action_p)

    return action.cpu().numpy(), action_p.cpu().numpy(), log_prob, lgp

where “state” is a torch.tensor.from_numpy, then “action” and “action_p” are different! In the same step you get “action = 0” but “action_p = 1”, which is not possible, because the action in the same step must be the same.

I other words: I would expect that action and action_p contain the same sampled value (“0” or “1”) since are samped from two different distributions but calculated with the same input data.
Or am I wrong?

Thanks

Here the network policy_network:

import torch.nn.functional as F
import torch.optim as optimizer
import torch.nn as nn
import torch
import os

class ReinforceNetwork(nn.Module):

def __init__(self, input_dims, layers, n_actions, lr, directory, name, gpu):

    super(ReinforceNetwork, self).__init__()

    self.input_dims = input_dims
    self.layers = layers
    self.n_actions = n_actions
    self.lr = lr
    self.directory = directory
    self.name = name
    self.gpu = gpu

    self.checkpoint_file = os.path.join(self.directory, self.name + "_0" + ".pth")

    self.fc1 = nn.Linear(in_features = self.input_dims, out_features = self.layers[0])
    self.fc2 = nn.Linear(in_features = self.layers[0], out_features = self.layers[1])
    self.mu = nn.Linear(in_features = self.layers[1], out_features = n_actions)

    self.relu = nn.ReLU()

    self.optimizer = optimizer.Adam(params = self.parameters(), lr = self.lr)

    self.device = torch.device(f"cuda:{self.gpu}" if torch.cuda.is_available() else "cpu")
    self.to(self.device)

def forward(self, state):

    state = self.fc1(state)
    state = F.leaky_relu(state)
    state = self.fc2(state)
    state = F.leaky_relu(state)
    mu = self.mu(state)

    return mu

By definition a sample is a pseudo random number from the distribution. Even from the same distribution, two successive samples should have a different value (except in degenerate cases or by chance, like getting two six after throwing a dice twice)

Ah, ok!
BUT my code is not wrong, right?
I can sample from a distribution using logits or probs (the latter after the softmax operation), right?

Yes both options are valid, rule of thumbs is that logits are more numerically stable but otherwise things should be ok!