I’m trying to implement the A3C reinforcement learning model with pytorch.
I am currently planning to construct a model in which the number of valid (legal) action depends on the state.
I have studied and searched a lot, but have concluded that there is no single answer, and I am using a mask on the output layer in the way I think is most logically correct.
We decided to use a soft max function as the output layer and implemented it.
But in detail, there are some problems. First, my mask is not fixed but a dynamically changing mask. As a result, it is not possible to conclude that the backpropagation process works normally.
I think it should be backpropagated only for units that make valid action in the soft max layer. Is this right? Also, how does Pytorch implement this?
*I would also appreciate comments and tips on how to filter out valid actions in reinforcement learning.