Backpropagation rule for REINFORCE weight updates using a Multinomial distribution

Hello,

I am implementing a REINFORCE algorithm (as superficially explained over here) using a Categorical distribution (or Multinomial distribution as argued in the explanation of the Categorical distribution) for sampling actions.

I would like to know what the backpropagation rule looks like when plugging a Multinomial distribution into the REINFORCE update procedure.

More specifically, to say it in Williams’ words, I would like to know what characteristic eligibility looks like when using a Multinomial distribution during the training of a REINFORCE algorithm. I saw in the aforementioned paper (by Williams) how the eligibility associated with a Bernoulli distribution looks like, but I don’t know how to generalize that further.
From the explanation of distributions in PyTorch, I know that all this is implemented by default in PyTorch. However, I would still like to know how this theoretically works or if there are any papers describing the backpropagation procedure I am interested in.

Thank you very much in advance!

More practically speaking, my problem is the following. Consider the minimal code snippet below:

import torch

print("Forward pass:")

# Probability vector for parameterizing Categorical distribution
probs = torch.tensor([[0.69, 0.1, 0.2, 0.01]], requires_grad=True)

# Define distribution
dist = torch.distributions.Categorical(probs=probs)

# Sampled category
category = torch.tensor([0])

# Get log prob for previously sampled category
log_prob = dist.log_prob(category)
log_prob.retain_grad()
print("log_prob:", log_prob)

# Get prob from log prob
prob = torch.exp(log_prob)
prob.retain_grad()
print("prob:", prob)

prob.backward()

print("Backward pass - Grads:")
print("probs-grad:", probs.grad)
print("log_prob-grad:", log_prob.grad)
print("prob-grad:", prob.grad)

Running the code results in the following output being generated:

Forward pass:
log_prob: tensor([-0.3711], grad_fn=<SqueezeBackward1>)
prob: tensor([0.6900], grad_fn=<ExpBackward>)
Backward pass - Grads:
probs-grad: tensor([[ 0.3100, -0.6900, -0.6900, -0.6900]])
log_prob-grad: tensor([0.6900])
prob-grad: tensor([1.])

Given the above, I am wondering how the gradient associated with the probs tensor is calculated. Trying to work out the maths, I would have expected the components of corresponding gradient to be 0 for all categories that have not been sampled. Instead, these gradient components are uniformly -0.6900.

Does anyone know by means of which rule the gradient vector associated with the probability vector probs is calculated?

This happens because the probabilities in the Categorical distribution are relative probabilities. So, the probability of category zero being chosen is probs_0 / sum(probs_i). Since the probs_i show up in the probability of drawing category 0, they will have a gradient. An easier way to see this (for me) is to convert probs to have the same relative probabilities but be much larger:

probs = torch.tensor([[69., 10., 20., 1.]], requires_grad=True)

You should see that none of the printed values change, except for probs-grad, which are now 100x smaller.