How to implement action sampling for differing allowed actions

marcel1991 · March 6, 2018, 6:17pm

I am trying to implement a basic Policy Gradient training setup. I saw some examples for games that had always the same possible actions in each state. However I am wondering how one would implement the action sampling from the policy when in some states not all actions are allowed.

Right now I do the action sampling like this:

action = torch.multinomial(policy, 1)
log_probs = torch.log(torch.gather(policy, -1, action))

For the limited action set I thought of something like this

limited_policy = policy[valids]
action = torch.multinomial(limited_policy, 1)
log_probs = torch.log(torch.gather(limited_policy, -1, action))

Here the problem is that the shape of the limited policy changes through the valid indexing and therefore not all rows have the same length anymore and then I cannot use it as a batch anymore.

So what is the right way of handling the limited action space that it also handles the gradients properly?

EDIT

I now set the probability actions manually to zero for not allowed actions like this:

policy[1 - valids] = 0
action = torch.multinomial(policy, 1)
log_probs = torch.log(torch.gather(policy, -1, action))

But I would prefer if there is a solution that somehow consideres this in the sampling function. So that the network can actually “learn” that these actions are not allowed or less wanted in some states.

alexis-jacq · March 6, 2018, 8:03pm

There is no need to learn which action is or is not allowed. By setting the forbidden actions probabilities to zero, your agent will only explore the allowed ones and learn what is the best action out of the allowed set.

However, if you want to reduce the log_prob of the forbidden actions in your gradient step, you can add a gradient direction that simply reduce the log_probabilities for the forbidden actions. in other words, instead of:

theta = theta - alpha * grad_logprobs * returns # (REINFORCE iteration)

you do:

theta = theta - alpha * (grad_logprobs * returns + grad_logprobs * forbidden) 
# where forbidden = 0 if a state-action is allowed

or again:

theta = theta - alpha * grad_logprobs * (returns+forbidden)

But I think it would be longer to train, as you will have to explore a bigger space.

marcel1991 · March 8, 2018, 11:52am

Hm okay thank you. Why is a forbidden task considered as a “higher reward”? Shouldn’t it be the negative or inverse like

grad_logprobs *  (returns - forbidden)

Wouldn’t that emphasize to do the forbidden task more often?
Or am I understanding something completely wrong here?

alexis-jacq · March 9, 2018, 9:18am

Yes, it should be negative, but the learning rate positive. I was too fast on this. The correct iteration is:

theta = theta + alpha * grad_logprobs * (returns - forbidden)

That way, you optimize the return, but still minimize the log_prob of forbidden actions.

By the way, you can do this and still force your agent to only explorate allowed actions, so you don’t lose time with exploration.

marcel1991 · March 9, 2018, 9:36am

Okay, thank you very much for the good explanation.

marcel1991 · March 10, 2018, 12:11pm

Okay I have a problem with the implementation of this now.

I do clone the action probabilities to avoid an inplace operation like this:

pActions, values = self.forward(observations)
legalActions = pActions.clone()
legalActions[1 - valids] = 0
actions = torch.multinomial(legalActions, 1).long()
logProbs = torch.log(torch.gather(pActions, -1, actions))

However now I get an cuda runtime error that seems to come from the multinomial operation. However the traceback says the error is triggered by the line where I calculate the policy loss with (negLogProbs * advt).mean() after which I call lossPolicy.backward():

/home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/THCTensorRandom.cuh:182: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = float, AccT = float]: block: [0,0,0], thread: [2,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
THCudaCheck FAIL file=/home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=339 error=59 : device-side assert triggered
Traceback (most recent call last):
 ....

  File "/home/marcel/Projects/test/train.py", line 40, in train
    lossPolicy = (negLogProbs * advt).mean()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/generated/../THCReduceAll.cuh:339

Is there anything wrong with the cloning of the legal actions? Does this backpropagate correctly?

Ricardo_Gama · May 27, 2018, 8:34am

Hello. I’m facing a similar problem.
Have you found a solution?
Thanks.

marcel1991 · May 28, 2018, 7:59am

The cuda runtime error is often misleading and can be the cause of different errors.

What you can do is let your network run on the cpu for debugging. This gives usually a better error behaviour and you can find the line which actually lead to the error.