Backpropagation rule for REINFORCE weight updates using a Multinomial distribution

Hello,

I am implementing a REINFORCE algorithm (as superficially explained over here) using a Categorical distribution (or Multinomial distribution as argued in the explanation of the Categorical distribution) for sampling actions.

I would like to know what the backpropagation rule looks like when plugging a Multinomial distribution into the REINFORCE update procedure.

More specifically, to say it in Williams’ words, I would like to know what characteristic eligibility looks like when using a Multinomial distribution during the training of a REINFORCE algorithm. I saw in the aforementioned paper (by Williams) how the eligibility associated with a Bernoulli distribution looks like, but I don’t know how to generalize that further.
From the explanation of distributions in PyTorch, I know that all this is implemented by default in PyTorch. However, I would still like to know how this theoretically works or if there are any papers describing the backpropagation procedure I am interested in.

Thank you very much in advance!

More practically speaking, my problem is the following. Consider the minimal code snippet below:

import torch

print("Forward pass:")

# Probability vector for parameterizing Categorical distribution
probs = torch.tensor([[0.69, 0.1, 0.2, 0.01]], requires_grad=True)

# Define distribution
dist = torch.distributions.Categorical(probs=probs)

# Sampled category
category = torch.tensor([0])

# Get log prob for previously sampled category
log_prob = dist.log_prob(category)
log_prob.retain_grad()
print("log_prob:", log_prob)

# Get prob from log prob
prob = torch.exp(log_prob)
prob.retain_grad()
print("prob:", prob)

prob.backward()

print("Backward pass - Grads:")
print("probs-grad:", probs.grad)
print("log_prob-grad:", log_prob.grad)
print("prob-grad:", prob.grad)

Running the code results in the following output being generated:

Forward pass:
log_prob: tensor([-0.3711], grad_fn=<SqueezeBackward1>)
prob: tensor([0.6900], grad_fn=<ExpBackward>)
Backward pass - Grads:
probs-grad: tensor([[ 0.3100, -0.6900, -0.6900, -0.6900]])
log_prob-grad: tensor([0.6900])
prob-grad: tensor([1.])

Given the above, I am wondering how the gradient associated with the probs tensor is calculated. Trying to work out the maths, I would have expected the components of corresponding gradient to be 0 for all categories that have not been sampled. Instead, these gradient components are uniformly -0.6900.

Does anyone know by means of which rule the gradient vector associated with the probability vector probs is calculated?