I am trying to implement a basic Policy Gradient training setup. I saw some examples for games that had always the same possible actions in each state. However I am wondering how one would implement the action sampling from the policy when in some states not all actions are allowed.
Here the problem is that the shape of the limited policy changes through the valid indexing and therefore not all rows have the same length anymore and then I cannot use it as a batch anymore.
So what is the right way of handling the limited action space that it also handles the gradients properly?
EDIT
I now set the probability actions manually to zero for not allowed actions like this:
But I would prefer if there is a solution that somehow consideres this in the sampling function. So that the network can actually “learn” that these actions are not allowed or less wanted in some states.
There is no need to learn which action is or is not allowed. By setting the forbidden actions probabilities to zero, your agent will only explore the allowed ones and learn what is the best action out of the allowed set.
However, if you want to reduce the log_prob of the forbidden actions in your gradient step, you can add a gradient direction that simply reduce the log_probabilities for the forbidden actions. in other words, instead of:
However now I get an cuda runtime error that seems to come from the multinomial operation. However the traceback says the error is triggered by the line where I calculate the policy loss with (negLogProbs * advt).mean() after which I call lossPolicy.backward():
/home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/THCTensorRandom.cuh:182: void sampleMultinomialOnce(long *, long, int, T *, T *, int, int) [with T = float, AccT = float]: block: [0,0,0], thread: [2,0,0] Assertion `THCNumerics<T>::ge(val, zero)` failed.
THCudaCheck FAIL file=/home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/generated/../THCReduceAll.cuh line=339 error=59 : device-side assert triggered
Traceback (most recent call last):
....
File "/home/marcel/Projects/test/train.py", line 40, in train
lossPolicy = (negLogProbs * advt).mean()
RuntimeError: cuda runtime error (59) : device-side assert triggered at /home/marcel/anaconda3/envs/pytorch/pytorch/aten/src/THC/generated/../THCReduceAll.cuh:339
Is there anything wrong with the cloning of the legal actions? Does this backpropagate correctly?
The cuda runtime error is often misleading and can be the cause of different errors.
What you can do is let your network run on the cpu for debugging. This gives usually a better error behaviour and you can find the line which actually lead to the error.