Action produce different results

Why does action produce different results?

import torch as T
import torch.nn.functional as F

f = T.tensor([0.8,0.5,0.4])
f2 = F.softmax(f)
print(f2)
action_probs = T.distributions.Categorical(f2)
print(action_probs)
action = action_probs.sample()
print(action)

You sample from a probability distribution here right. So you expect to get different results every time you sample from it.

If you want deterministic behavior, your can fix the random seed for the random generator.

I can not understand the logic of T.distributions.Categorical. The neural network gives me three actions. Softmax gives probabilities. In theory, I choose the maximum (argmax). But I can’t understand why:

action_probs = T.distributions.Categorical(f2)
action = action_probs.sample()

Can you explain this to me in more detail?))

I can’t understand how the connection between the output of the neural network and action = action_probs.sample () is. If the neural network gives the maximum probability for 0, and the action gives 1. What is the point?

Hi,

The point for the distributions package is to actually sample from distributions, not get the element with max probability.
If you only want the element with max probability, you should use max. If you want to actually sample from the probability that is defined with the weights you have, use distributions.

This is an example from the documentation.

probs = policy_network(state)
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

It seems to me more logical to do this:

probs = policy_network(state)
action  = T.argmax(probs)
m = Categorical(probs)
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

Explain to me why use action = m.sample ()?

Maybe to better explore the environment?

You should look into the REINFORCE algorithm for that.
But from what I remember, the gist is that the reward should be the expected reward by following the probs that you computed.
If you take the argmax, you don’t have the expected reward of your computed probs and so you don’t have an unbiased estimate of the gradients.

Well, logically, if you think about it. The neural network outputs data, I take the maximum value. But if I use action = m.sample (), then I get the action is not clear by what distribution.

m.sample() generate a sample from the distribution given when creating m. Which is probs here.