What does autograd of actor_critic.py do?

Hi,

I want to know why [torch.ones(1)] should be the first gradient of autograd.backward in this example https://github.com/pytorch/examples/blob/master/reinforcement_learning/actor_critic.py#L77-L79

final_nodes = [value_loss] + list(map(lambda p: p.action, saved_actions))
gradients = [torch.ones(1)] + [None] * len(saved_actions)
autograd.backward(final_nodes, gradients)

I expect that the example of Variable.reinforce is a sugar of Variable.backward like this

SavedAction = namedtuple('SavedAction', ['action', 'value', 'logp'])


def select_action(state):
    state = torch.from_numpy(state).float().unsqueeze(0)
    probs, state_value = model(Variable(state))
    action = probs.multinomial().detach()
    logp = probs.index_select(dim=1, index=action.squeeze(1)).log()
    model.saved_actions.append(SavedAction(action, state_value, logp))
    return action.data


def finish_episode():
    R = 0
    saved_actions = model.saved_actions
    rewards = []
    for r in model.rewards[::-1]:
        R = r + args.gamma * R
        rewards.insert(0, R)
    rewards = torch.Tensor(rewards)
    rewards = (rewards - rewards.mean()) / (rewards.std() + np.finfo(np.float32).eps)
    loss = 0.0
    for (action, value, logp), r in zip(saved_actions, rewards):
        reward = r - value.data[0,0]
        loss += logp * -reward  # NOTE: action.reinforce(reward)
        loss += F.smooth_l1_loss(value, Variable(torch.Tensor([r])))
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    del model.rewards[:]
    del model.saved_actions[:]

I confirmed a approx match of their results.

I think you have the answer to your question here: What is action.reinforce(r) doing actually?

I prefer your code than the one in the example, it is easier to understand what is done on the lines when the loss is well written, rather than using action.reinforce(reward) which seems to me quite not intuitive…

1 Like

@alexis-jacq thank you for pointing that example.
let me compare actor_critic.py and reinforce.py

final_nodes = [value_loss] + list(map(lambda p: p.action, saved_actions))
gradients = [torch.ones(1)] + [None] * len(saved_actions)
autograd.backward(final_nodes, gradients)
autograd.backward(policy.saved_actions, [None for _ in policy.saved_actions])

I found the reason why value_loss requires 1 while saved_actions do [None, ...] as follows;

  • these value are initial backprop derivatives dL/dy in a chain rule: dL/dx = dL/dy * dy/dx
  • value_loss's initial gradient is 1 as a usual regression loss: (y=L)
    • and reinforce.py does not have such a loss. it use mean/var simply instead of trainable normalizer (i.e., value net) in actor_critic.py
  • saved_actions have no backprop derivatives because they are detached as seen in my snippet.

is it right?