I am straying out of my domain knowledge to attempt a basic reinforcement learning task in a toy environment and have become fairly familiar with the REINFORCE algorithm for policy gradient agents, especially PyTorch’s implementation (found here). It is clear to me now that there are superior methods to train RL agents (PPO for instance), but as I read, these feel beyond my current intellectual or time resources. As such, I’d like to eek out as much power through modifications of REINFORCE as possible before determining how I might move on.
As such, are there modifications to the REINFORCE training algorithm that might yield benefits without straying far into new algorithm territory? Or perhaps, what is the “SOTA” version of REINFORCE?
For instance, perhaps a simple gradient clip in some way approximates some of PPO’s benefits? Or maybe setting a baseline reward based on a rolling reward set of previous episodes?
If a specific context is useful, I’m applying this to a self-made simple grid environment where agents receive points namely for moving closer to and acquiring “targets.” There are other rewards of lesser significance, but key is the environment is set for a sort of “continuous play” such that agents are very frequently receiving rewards (due to closeness) and occasionally receiving reward spikes (due to getting a target), but there is no true episode definition other than an arbitrary timestep length. I am not using batches (perhaps that is useful?), as I have found that gradually stepping up the episode length allows the agents quicker access to simpler rewards and appears a sort of scaffolding to more complex behavior. Agents might reasonably gather the “large” rewards in as few as 10-25 steps. Generally things are working fine, and I am most interested in how to extract as much value from the agent updating mechanism (REINFORCE) as possible.