Action produce different results

slavavs · October 14, 2019, 3:39pm

Why does action produce different results?

import torch as T
import torch.nn.functional as F

f = T.tensor([0.8,0.5,0.4])
f2 = F.softmax(f)
print(f2)
action_probs = T.distributions.Categorical(f2)
print(action_probs)
action = action_probs.sample()
print(action)

albanD · October 14, 2019, 3:57pm

You sample from a probability distribution here right. So you expect to get different results every time you sample from it.

If you want deterministic behavior, your can fix the random seed for the random generator.

slavavs · October 14, 2019, 4:32pm

I can not understand the logic of T.distributions.Categorical. The neural network gives me three actions. Softmax gives probabilities. In theory, I choose the maximum (argmax). But I can’t understand why:

action_probs = T.distributions.Categorical(f2)
action = action_probs.sample()

Can you explain this to me in more detail?))

slavavs · October 14, 2019, 5:04pm

I can’t understand how the connection between the output of the neural network and action = action_probs.sample () is. If the neural network gives the maximum probability for 0, and the action gives 1. What is the point?

albanD · October 14, 2019, 6:27pm

Hi,

The point for the distributions package is to actually sample from distributions, not get the element with max probability.
If you only want the element with max probability, you should use max. If you want to actually sample from the probability that is defined with the weights you have, use distributions.

slavavs · October 14, 2019, 6:27pm

This is an example from the documentation.

probs = policy_network(state)
m = Categorical(probs)
action = m.sample()
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

It seems to me more logical to do this:

probs = policy_network(state)
action  = T.argmax(probs)
m = Categorical(probs)
next_state, reward = env.step(action)
loss = -m.log_prob(action) * reward
loss.backward()

Explain to me why use action = m.sample ()?

Maybe to better explore the environment?

albanD · October 14, 2019, 6:34pm

You should look into the REINFORCE algorithm for that.
But from what I remember, the gist is that the reward should be the expected reward by following the probs that you computed.
If you take the argmax, you don’t have the expected reward of your computed probs and so you don’t have an unbiased estimate of the gradients.

slavavs · October 14, 2019, 6:40pm

Well, logically, if you think about it. The neural network outputs data, I take the maximum value. But if I use action = m.sample (), then I get the action is not clear by what distribution.

albanD · October 14, 2019, 6:43pm

m.sample() generate a sample from the distribution given when creating m. Which is probs here.