Hi all,

I’m working on a 2-state RL learning problem where the states are independent of any actions taken (and thus all rollouts can be done in batch).

loss = -(log(A_p) *R).sum()

I am able to come up with a simple supervised solution for the same problem, using a threshold, so I feel like it should be possible to get there with a stochastic policy gradient. Does anyone have ideas as to why the above might be the case, and potential tricks that might help convergence?

Thanks