I’m trying to implement the CartPole-v1 problem from the Book “Introduction into RL” by trying to adapt their code posted in their repository here: link.

Their code is old, so I changed it for working with Gymnasium 0.28.1 and I removed this line of code:

action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy()) #E

with the code here:

policy_dist = Categorical(probs = act_prob)

action = policy_dist.sample()

log_prob = policy_dist.log_prob(action)

action = action.cpu().numpy()

because I want to sample action from the Categorical distribution instead of using numpy to do that.

Here is the code, which should run once copied into your system.

import numpy as np

import torch

import gymnasium as gymsfrom torch.distributions import Categorical

env = gyms.make(“CartPole-v1”)

l1 = 4

l2 = 150

l3 = 2model = torch.nn.Sequential(

torch.nn.Linear(l1, l2),

torch.nn.LeakyReLU(),

torch.nn.Linear(l2, l3),

torch.nn.Softmax(dim=0)

)learning_rate = 0.009

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)def discount_rewards(rewards, gamma=0.99):

lenr = len(rewards)

disc_return = torch.pow(gamma,torch.arange(lenr).float()) * rewards

disc_return /= disc_return.max()`return disc_return`

def loss_fn(preds, r):

return -1 * torch.sum(r * torch.log(preds))MAX_DUR = 499

MAX_EPISODES = 500

gamma = 0.99

scores =

expectation = 0.0for episode in range(MAX_EPISODES):

curr_state, info = env.reset()

termination = False

truncation = False

transitions =

score = 0`for t in range(MAX_DUR): act_prob = model(torch.from_numpy(curr_state).float()) policy_dist = Categorical(probs = act_prob) action = policy_dist.sample() log_prob = policy_dist.log_prob(action) action = action.cpu().numpy() prev_state = curr_state curr_state, reward, termination, truncation, info = env.step(action) score += reward transitions.append((prev_state, action, log_prob, t+1)) if termination or truncation: break optimizer.zero_grad() reward_batch = torch.Tensor([r for (s,a,lp, r) in transitions]).flip(dims=(0,)) disc_returns = discount_rewards(reward_batch) state_batch = torch.stack([torch.from_numpy(s) for (s,a,lp,r) in transitions]) action_batch = torch.stack([torch.from_numpy(a) for (s,a,lp,r) in transitions]) log_prob = torch.stack([lp for (s,a,lp,r) in transitions]) pred_batch = model(state_batch) prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze() # Here the comparation loss = loss_fn(prob_batch, disc_returns) new_loss = -1 * torch.sum(disc_returns * log_prob) loss.backward() optimizer.step() scores.append(score) avg_score = np.mean(scores[-100:]) print(f"Episode: {episode + 1}, avg_score: {avg_score}")`

Now, I didn’t want to use the t`orch.log`

call, because the Categorical distributions outputs the `log_prob`

automatically. So I’m trying to calculate the loss differently. The original code does:

loss = -1 * torch.sum(disc_returns * torch.log(probs))

where probs is the output of the softmax function applied to the output of the network model.

I indeed I’m doing the following:

new_loss = -1 * torch.sum(disc_returns * log_prob)

where `log_prob`

is given directly by the Categorical distribution once the action is sampled:

log_prob = policy_dist.log_prob(action)

But I get not the same result.

I outputted the result from the `torch.log(probs)`

and from the `log_prob`

operations and they are quite different:

tensor([-2.6778, -2.6924, -2.7011, -2.6968, -2.7408, -2.6880, -2.6965, -2.7147, -2.6316, -2.6783, -2.6910, -2.7152, -2.6167, -2.6579, -2.6910], grad_fn=)

tensor([-0.6593, -0.6677, -0.6946, -0.6593, -0.6892, -0.6697, -0.6991, -0.7334, -0.6097, -0.6514, -0.7062, -0.7417, -0.6036, -0.6406, -0.7169], grad_fn=)

But they should be more or less the same.

Any idea what’s wrong here?