I was trying to implement some RL code which uses “Categorical(probs)” in combination with “softmax” to sample one action (by the way, the environment used is CartPole-v1 from OpenAI (Gymnasium)).
But since I changed the reference code in the repository in order to use “Categorical(logits)” instead of using “softmax” + “Categorical(probs)”, I realized that I cannot achieve the same results.
So I try to implement both method to find some differences.
What I found is that the method “sample()” from Categorical delivers different results, when used with logits or probs, even if the input is the same:
If you run the following code in any environemnt:
def choose_action(self, state):
state = torch.from_numpy(state).float().to(self.policy_network.device)
logits = self.policy_network(state)
probs = self.softmax(self.policy_network(state))
policy_dist = Categorical(logits = logits)
policy_probs = Categorical(probs = probs)
action = policy_dist.sample()
action_p = policy_probs.sample()
log_prob = policy_dist.log_prob(action)
lgp = policy_probs.log_prob(action_p)
return action.cpu().numpy(), action_p.cpu().numpy(), log_prob, lgp
where “state” is a torch.tensor.from_numpy, then “action” and “action_p” are different! In the same step you get “action = 0” but “action_p = 1”, which is not possible, because the action in the same step must be the same.
I other words: I would expect that action
and action_p
contain the same sampled value (“0” or “1”) since are samped from two different distributions but calculated with the same input data.
Or am I wrong?
Thanks
Here the network policy_network:
import torch.nn.functional as F
import torch.optim as optimizer
import torch.nn as nn
import torch
import osclass ReinforceNetwork(nn.Module):
def __init__(self, input_dims, layers, n_actions, lr, directory, name, gpu): super(ReinforceNetwork, self).__init__() self.input_dims = input_dims self.layers = layers self.n_actions = n_actions self.lr = lr self.directory = directory self.name = name self.gpu = gpu self.checkpoint_file = os.path.join(self.directory, self.name + "_0" + ".pth") self.fc1 = nn.Linear(in_features = self.input_dims, out_features = self.layers[0]) self.fc2 = nn.Linear(in_features = self.layers[0], out_features = self.layers[1]) self.mu = nn.Linear(in_features = self.layers[1], out_features = n_actions) self.relu = nn.ReLU() self.optimizer = optimizer.Adam(params = self.parameters(), lr = self.lr) self.device = torch.device(f"cuda:{self.gpu}" if torch.cuda.is_available() else "cpu") self.to(self.device) def forward(self, state): state = self.fc1(state) state = F.leaky_relu(state) state = self.fc2(state) state = F.leaky_relu(state) mu = self.mu(state) return mu