I am taking Berkeley’s CS285 via self-study. On this particular lecture regarding Policy Gradient, I am very confused about the inconsistency between the concept explanation and the demonstration of code snippet. I am new to RL
and hope someone could clarify this for me.
Context
1.The lecture defines policy gradient as follow:
log(pi_theta(a | s)) denotes the log probability of action given state under policy parameterized by theta
gradient log(pi_theta(a |s)) denotes the gradient of parameter theta with respect to the predicted log probability of action
2.The lecture defines a pseudo-loss.By auto differentiate the pseudo-loss, we recovery the policy gradient.
Here Q_hat is short hand of sum of r(s_i_t, a_i_t) in the Policy gradient equation under 1)
- The lecture then proceeds to gives a pseudo-code implementation of 2)
My confusion
From 1) above,
gradient log(pi_theta(a |s)) denotes the gradient of parameter theta with respect to the predicted log probability of action , not a loss value calculated from a label action and predicted action.
Why does the below in 2) implies that gradient log(pi_theta(a |s)) just morph into output of loss function instead of just predicted action probability as defined in 1) ?
In this pseudo-code implementation,
Particularly, this line below.
negative_likelihoods = tf.nn.softmax_cross_entrophy_with_logis(labels=actions, logits=logits)
Where does the actions even coming from ? If it comes from collected trajectory, aren’t the actions result of logits = policy.predictions(states)
to begin with ?
Then won’t tf.nn.softmax_cross_entrophy_with_logis(labels=actions, logits=logits)
always return 0 ?
Based on the definition of policy gradient in 1), shouldn’t the implementation of pseudo-loss be like below ?
# Given:
# actions - (N*T) x Da tensor of actions
# states - (N*T) x Ds tensor of states
# q_values – (N*T) x 1 tensor of estimated state-action values
# Build the graph:
logits = policy.predictions(states) # This should return (N*T) x Da tensor of action logits
weighted_predicted_probability = tf.multiply(torch.softmax(logits), q_values)
loss = tf.reduce_mean(weighted_predicted_probability )
gradients = loss.gradients(loss, variables)