I’m trying to implement the CartPole-v1 problem from the Book “Introduction into RL” by trying to adapt their code posted in their repository here: link.
Their code is old, so I changed it for working with Gymnasium 0.28.1 and I removed this line of code:
action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy()) #E
with the code here:
policy_dist = Categorical(probs = act_prob)
action = policy_dist.sample()
log_prob = policy_dist.log_prob(action)
action = action.cpu().numpy()
because I want to sample action from the Categorical distribution instead of using numpy to do that.
Here is the code, which should run once copied into your system.
import numpy as np
import torch
import gymnasium as gymsfrom torch.distributions import Categorical
env = gyms.make(“CartPole-v1”)
l1 = 4
l2 = 150
l3 = 2model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax(dim=0)
)learning_rate = 0.009
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)def discount_rewards(rewards, gamma=0.99):
lenr = len(rewards)
disc_return = torch.pow(gamma,torch.arange(lenr).float()) * rewards
disc_return /= disc_return.max()return disc_return
def loss_fn(preds, r):
return -1 * torch.sum(r * torch.log(preds))MAX_DUR = 499
MAX_EPISODES = 500
gamma = 0.99
scores =
expectation = 0.0for episode in range(MAX_EPISODES):
curr_state, info = env.reset()
termination = False
truncation = False
transitions =
score = 0for t in range(MAX_DUR): act_prob = model(torch.from_numpy(curr_state).float()) policy_dist = Categorical(probs = act_prob) action = policy_dist.sample() log_prob = policy_dist.log_prob(action) action = action.cpu().numpy() prev_state = curr_state curr_state, reward, termination, truncation, info = env.step(action) score += reward transitions.append((prev_state, action, log_prob, t+1)) if termination or truncation: break optimizer.zero_grad() reward_batch = torch.Tensor([r for (s,a,lp, r) in transitions]).flip(dims=(0,)) disc_returns = discount_rewards(reward_batch) state_batch = torch.stack([torch.from_numpy(s) for (s,a,lp,r) in transitions]) action_batch = torch.stack([torch.from_numpy(a) for (s,a,lp,r) in transitions]) log_prob = torch.stack([lp for (s,a,lp,r) in transitions]) pred_batch = model(state_batch) prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze() # Here the comparation loss = loss_fn(prob_batch, disc_returns) new_loss = -1 * torch.sum(disc_returns * log_prob) loss.backward() optimizer.step() scores.append(score) avg_score = np.mean(scores[-100:]) print(f"Episode: {episode + 1}, avg_score: {avg_score}")
Now, I didn’t want to use the torch.log
call, because the Categorical distributions outputs the log_prob
automatically. So I’m trying to calculate the loss differently. The original code does:
loss = -1 * torch.sum(disc_returns * torch.log(probs))
where probs is the output of the softmax function applied to the output of the network model.
I indeed I’m doing the following:
new_loss = -1 * torch.sum(disc_returns * log_prob)
where log_prob
is given directly by the Categorical distribution once the action is sampled:
log_prob = policy_dist.log_prob(action)
But I get not the same result.
I outputted the result from the torch.log(probs)
and from the log_prob
operations and they are quite different:
tensor([-2.6778, -2.6924, -2.7011, -2.6968, -2.7408, -2.6880, -2.6965, -2.7147, -2.6316, -2.6783, -2.6910, -2.7152, -2.6167, -2.6579, -2.6910], grad_fn=)
tensor([-0.6593, -0.6677, -0.6946, -0.6593, -0.6892, -0.6697, -0.6991, -0.7334, -0.6097, -0.6514, -0.7062, -0.7417, -0.6036, -0.6406, -0.7169], grad_fn=)
But they should be more or less the same.
Any idea what’s wrong here?