What modifications can maximize the efficacy of the REINFORCE algorithm for a policy gradient task?

I am straying out of my domain knowledge to attempt a basic reinforcement learning task in a toy environment and have become fairly familiar with the REINFORCE algorithm for policy gradient agents, especially PyTorch’s implementation (found here). It is clear to me now that there are superior methods to train RL agents (PPO for instance), but as I read, these feel beyond my current intellectual or time resources. As such, I’d like to eek out as much power through modifications of REINFORCE as possible before determining how I might move on.

As such, are there modifications to the REINFORCE training algorithm that might yield benefits without straying far into new algorithm territory? Or perhaps, what is the “SOTA” version of REINFORCE?

For instance, perhaps a simple gradient clip in some way approximates some of PPO’s benefits? Or maybe setting a baseline reward based on a rolling reward set of previous episodes?

If a specific context is useful, I’m applying this to a self-made simple grid environment where agents receive points namely for moving closer to and acquiring “targets.” There are other rewards of lesser significance, but key is the environment is set for a sort of “continuous play” such that agents are very frequently receiving rewards (due to closeness) and occasionally receiving reward spikes (due to getting a target), but there is no true episode definition other than an arbitrary timestep length. I am not using batches (perhaps that is useful?), as I have found that gradually stepping up the episode length allows the agents quicker access to simpler rewards and appears a sort of scaffolding to more complex behavior. Agents might reasonably gather the “large” rewards in as few as 10-25 steps. Generally things are working fine, and I am most interested in how to extract as much value from the agent updating mechanism (REINFORCE) as possible.

Yes, I would recommend trying out reinforce with baselines.
Many people like PPO because it is quite efficient and there are nice out-of-the-box implementations out there.
For your problem specifically, I guess using batches would not harm.

Overall, it’s not super easy to answer because I have a limited understanding of your problem, but feel free to elaborate a bit more on the technical details if you’d like.

If that helps, we could provide a minimal example of PPO in TorchRL for you to try.

1 Like

Thanks! My problem is currently a grid-based toy problem where agents attempt to acquire targets in a team setting, but it’s really a moving problem as I’m trying to learn through doing and see if RL is an area I’d like to invest more of my knowledge and time-base in.

Since posting, I’ve also implemented PyTorch’s critic code as a baseline as well. The result was less peak scores but slightly more stable training. Batching has surprisingly had very mixed effects, although I’m not sure I’ve implemented that correctly.

I think a minimal example of PPO in PyTorch’s example section would be fantastic given that is seems to be high in the sort of cultural zeitgeist of RL at the moment, but beggars (me) can’t be choosers of course! The simple REINFORCE example was a great entry point for me.

Good tip, I’ll add that (a PPO tutorial on pytorch) on my to-do list

Sweet business. I’ll be happy to check back in after a time and give it a read then!