Fighting against distributions Categorical: log_prob is delivering unexpected values

I’m trying to implement the CartPole-v1 problem from the Book “Introduction into RL” by trying to adapt their code posted in their repository here: link.

Their code is old, so I changed it for working with Gymnasium 0.28.1 and I removed this line of code:

action = np.random.choice(np.array([0,1]), p=act_prob.data.numpy()) #E

with the code here:

policy_dist = Categorical(probs = act_prob)
action = policy_dist.sample()
log_prob = policy_dist.log_prob(action)
action = action.cpu().numpy()

because I want to sample action from the Categorical distribution instead of using numpy to do that.
Here is the code, which should run once copied into your system.

import numpy as np
import torch
import gymnasium as gyms

from torch.distributions import Categorical

env = gyms.make(“CartPole-v1”)

l1 = 4
l2 = 150
l3 = 2

model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax(dim=0)
)

learning_rate = 0.009
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def discount_rewards(rewards, gamma=0.99):
lenr = len(rewards)
disc_return = torch.pow(gamma,torch.arange(lenr).float()) * rewards
disc_return /= disc_return.max()

return disc_return

def loss_fn(preds, r):
return -1 * torch.sum(r * torch.log(preds))

MAX_DUR = 499
MAX_EPISODES = 500
gamma = 0.99
scores =
expectation = 0.0

for episode in range(MAX_EPISODES):
curr_state, info = env.reset()
termination = False
truncation = False
transitions =
score = 0

for t in range(MAX_DUR):
    act_prob = model(torch.from_numpy(curr_state).float())
    policy_dist = Categorical(probs = act_prob)
    action = policy_dist.sample()
    log_prob = policy_dist.log_prob(action)
    action = action.cpu().numpy()

    prev_state = curr_state
    curr_state, reward, termination, truncation, info = env.step(action)
    score += reward
    transitions.append((prev_state, action, log_prob, t+1))

    if termination or truncation:
        break

optimizer.zero_grad()

reward_batch = torch.Tensor([r for (s,a,lp, r) in transitions]).flip(dims=(0,))

disc_returns = discount_rewards(reward_batch)

state_batch = torch.stack([torch.from_numpy(s) for (s,a,lp,r) in transitions])
action_batch = torch.stack([torch.from_numpy(a) for (s,a,lp,r) in transitions])
log_prob = torch.stack([lp for (s,a,lp,r) in transitions])

pred_batch = model(state_batch)

prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze()

# Here the comparation
loss = loss_fn(prob_batch, disc_returns)

new_loss = -1 * torch.sum(disc_returns * log_prob)

loss.backward()
optimizer.step()

scores.append(score)
avg_score = np.mean(scores[-100:])

print(f"Episode: {episode + 1}, avg_score: {avg_score}")

Now, I didn’t want to use the torch.log call, because the Categorical distributions outputs the log_prob automatically. So I’m trying to calculate the loss differently. The original code does:

loss = -1 * torch.sum(disc_returns * torch.log(probs))

where probs is the output of the softmax function applied to the output of the network model.

I indeed I’m doing the following:

new_loss = -1 * torch.sum(disc_returns * log_prob)

where log_prob is given directly by the Categorical distribution once the action is sampled:

log_prob = policy_dist.log_prob(action)

But I get not the same result.
I outputted the result from the torch.log(probs) and from the log_prob operations and they are quite different:

tensor([-2.6778, -2.6924, -2.7011, -2.6968, -2.7408, -2.6880, -2.6965, -2.7147, -2.6316, -2.6783, -2.6910, -2.7152, -2.6167, -2.6579, -2.6910], grad_fn=)
tensor([-0.6593, -0.6677, -0.6946, -0.6593, -0.6892, -0.6697, -0.6991, -0.7334, -0.6097, -0.6514, -0.7062, -0.7417, -0.6036, -0.6406, -0.7169], grad_fn=)

But they should be more or less the same.
Any idea what’s wrong here?

Many thanks!!!

I have to admit, that I understand now more. But there are still some subtle problems, which are not completely clear to me.

By the way, here the full code with the changes:

import numpy as np
import torch
import gymnasium as gyms

from torch.distributions import Categorical

torch.autograd.set_detect_anomaly(True)

env = gyms.make(“CartPole-v1”)

l1 = 4
l2 = 150
l3 = 2

model = torch.nn.Sequential(
torch.nn.Linear(l1, l2),
torch.nn.LeakyReLU(),
torch.nn.Linear(l2, l3),
torch.nn.Softmax(dim = 1)
)

learning_rate = 0.009
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate)

def discount_rewards(rewards, gamma=0.99):
lenr = len(rewards)
disc_return = torch.pow(gamma,torch.arange(lenr).float()) * rewards
disc_return /= disc_return.max()
return disc_return

def loss_fn(preds, r):
return -1 * torch.sum(r * torch.log(preds))

MAX_DUR = 499
MAX_EPISODES = 500
gamma = 0.99
scores =
expectation = 0.0

for episode in range(MAX_EPISODES):
curr_state, info = env.reset()
termination = False
truncation = False
transitions =
score = 0

for t in range(MAX_DUR):
    act_prob = model(torch.from_numpy(curr_state).unsqueeze(dim = 0).float())
    policy_dist = Categorical(probs = act_prob)
    action = policy_dist.sample()
    log_prob = policy_dist.log_prob(action)
    action = action.cpu().numpy()[0]

    prev_state = curr_state
    curr_state, reward, termination, truncation, info = env.step(action)
    score += reward
    transitions.append((prev_state, action, log_prob, t+1))

    if termination or truncation:
        break

ep_len = len(transitions)

optimizer.zero_grad()

reward_batch = torch.Tensor([r for (s,a,lp, r) in transitions]).flip(dims=(0,))

disc_returns = discount_rewards(reward_batch)

state_batch = torch.stack([torch.from_numpy(s) for (s,a,lp,r) in transitions])
action_batch = torch.tensor([a for (s,a,lp,r) in transitions])
log_prob = torch.stack([lp for (s,a,lp,r) in transitions])

pred_batch = model(state_batch)

prob_batch = pred_batch.gather(dim=1,index=action_batch.long().view(-1,1)).squeeze()

# Here the comparation
loss = loss_fn(prob_batch, disc_returns)

new_loss = -1 * torch.sum(disc_returns * log_prob.view(-1))

new_loss.backward()
optimizer.step()

scores.append(score)
avg_score = np.mean(scores[-100:])

print(f"Episode: {episode + 1}, avg_score: {avg_score:.4f}")

If I print the two variables torch.log(prob_batch) and log_prob, then I get exact the same output:

tensor([-0.7508, -0.6382, -0.7480, -0.6418, -0.7458, -0.7440, -0.7561, -0.7740, -0.7822, -0.6004, -0.7743, -0.7887, -0.8136, -0.5656, -0.8112], grad_fn= LogBackward0>)
tensor([-0.7508, -0.6382, -0.7480, -0.6418, -0.7458, -0.7440, -0.7561, -0.7740, -0.7822, -0.6004, -0.7743, -0.7887, -0.8136, -0.5656, -0.8112], grad_fn= ViewBackward0>)

If I calculate the loss using the torch.log(prob_batch), then the algorithm learns in the 500 games
But as far as I use the log_prob, then the same algorithm in the same conditions does not learns anything.
I think it depends on the grad_fn variable. In one case is grad_fn=<LogBackward0>, in the other case is grad_fn=<ViewBackward0> . This is the only difference.
What else could lead to the different behavior?