Confusion about computing policy gradient with automatic differentiation ( material from Berkeley CS285)

I am taking Berkeley’s CS285 via self-study. On this particular lecture regarding Policy Gradient, I am very confused about the inconsistency between the concept explanation and the demonstration of code snippet. I am new to RL
and hope someone could clarify this for me.

Context

1.The lecture defines policy gradient as follow:

log(pi_theta(a | s)) denotes the log probability of action given state under policy parameterized by theta

gradient log(pi_theta(a |s)) denotes the gradient of parameter theta with respect to the predicted log probability of action

2.The lecture defines a pseudo-loss.By auto differentiate the pseudo-loss, we recovery the policy gradient.

Here Q_hat is short hand of sum of r(s_i_t, a_i_t) in the Policy gradient equation under 1)

  1. The lecture then proceeds to gives a pseudo-code implementation of 2)

My confusion

From 1) above,

gradient log(pi_theta(a |s)) denotes the gradient of parameter theta with respect to the predicted log probability of action , not a loss value calculated from a label action and predicted action.

Why does the below in 2) implies that gradient log(pi_theta(a |s)) just morph into output of loss function instead of just predicted action probability as defined in 1) ?

In this pseudo-code implementation,

Particularly, this line below.

negative_likelihoods = tf.nn.softmax_cross_entrophy_with_logis(labels=actions, logits=logits)

Where does the actions even coming from ? If it comes from collected trajectory, aren’t the actions result of logits = policy.predictions(states) to begin with ?
Then won’t tf.nn.softmax_cross_entrophy_with_logis(labels=actions, logits=logits) always return 0 ?

Based on the definition of policy gradient in 1), shouldn’t the implementation of pseudo-loss be like below ?


# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits 
weighted_predicted_probability = tf.multiply(torch.softmax(logits), q_values)
loss = tf.reduce_mean(weighted_predicted_probability )
gradients = loss.gradients(loss, variables)