# Backpropagation rule for REINFORCE weight updates using a Multinomial distribution

Hello,

I am implementing a REINFORCE algorithm (as superficially explained over here) using a Categorical distribution (or Multinomial distribution as argued in the explanation of the Categorical distribution) for sampling actions.

I would like to know what the backpropagation rule looks like when plugging a Multinomial distribution into the REINFORCE update procedure.

More specifically, to say it in Williams’ words, I would like to know what characteristic eligibility looks like when using a Multinomial distribution during the training of a REINFORCE algorithm. I saw in the aforementioned paper (by Williams) how the eligibility associated with a Bernoulli distribution looks like, but I don’t know how to generalize that further.
From the explanation of distributions in PyTorch, I know that all this is implemented by default in PyTorch. However, I would still like to know how this theoretically works or if there are any papers describing the backpropagation procedure I am interested in.

Thank you very much in advance!

More practically speaking, my problem is the following. Consider the minimal code snippet below:

``````import torch

print("Forward pass:")

# Probability vector for parameterizing Categorical distribution
probs = torch.tensor([[0.69, 0.1, 0.2, 0.01]], requires_grad=True)

# Define distribution
dist = torch.distributions.Categorical(probs=probs)

# Sampled category
category = torch.tensor([0])

# Get log prob for previously sampled category
log_prob = dist.log_prob(category)
print("log_prob:", log_prob)

# Get prob from log prob
prob = torch.exp(log_prob)
print("prob:", prob)

prob.backward()

``````

Running the code results in the following output being generated:

``````Forward pass:
Given the above, I am wondering how the gradient associated with the `probs` tensor is calculated. Trying to work out the maths, I would have expected the components of corresponding gradient to be `0` for all categories that have not been sampled. Instead, these gradient components are uniformly `-0.6900`.
Does anyone know by means of which rule the gradient vector associated with the probability vector `probs` is calculated?