Training to skill-match -- RewArt or something else?

I hope this is the right place to seek help. I’ve asked elsewhere, but nobody has taken the bait. Any help, even just point-me-in-the-right-direction help, would be greatly appreciated. I know I reference SB3 here, rather than pytorch, but I’m suspicious that the real help I need is conceptual, and possibly pytorchish. Plus, I find no SB3 forums, other than stack-overflow tags, e.g.

My journey was quite successful until I got stuck. I’ll summarize…

I implemented Reversi/Othello and planted in a gymnasium Env. I trained with SB3 PPO, then enjoyed faster results with MaskablePPO, implemented to screen out invalid moves. I watched tensorboard plot the nice curve to “success” – this training played against a completely “random” opponent (whose moves were also pre-screened with the masker). The agent trained on “random” did in-fact take to the training, growing asymptotically to a satisfactory REW/score, but it was not good enough to beat me; I could always win if I tried. I’m a decent player, though. Then I trained a new agent against this initial, “pretty good” agent. This new agent got quite good - it could beat me; not always, but often enough.

So, now what? Well, it was finally time to work on my actual goal. My actual goal is to train an agent not to “win” - not to out-play any opponent, which is the typical goal - but, rather, to “skill-match” play an opponent. That is, ultimately, I want the agent to play at approximately the level of the (real person) opponent. It may be that the agent could discern this skill level and adapt to it within the game, or it may be that several games have to be played; I’m not sure.

The obvious first thing I tried was modifying my reward setup. Originally, as expected, the trainee agent was rewarded, move by move (i.e., each ‘step’), for tipping the score in the agent’s favor. (Actually, this reward was calculated after each even number of plays, so that both the agent and the opponent had played, offering the highest chance of an “even” score, but then, of course, rewarding the agent for having the higher momentary score.) In addition, a weighted reward was granted at the end of the game, since the most important thing is winning, and sometimes there are significant reversals move-by-move. Anyway, instead of rewarding for superiority, I rewarded, now, for “equality”. That is, higher rewards were granted, turn-by-turn, for the agent achieving a net score “near zero”. Furthermore, a weighted score is granted at the end of the game for a net score “near zero” (the nearer to zero, the higher the reward).

Intuitively, this already sounds suspicious to me. I quickly decided that rewarding this way while playing the “random” opponent would just be moot - the agent would be training to play randomly; this is no training at all. So I trained it against the trained agent that could beat me occasionally. However, tensorboard revealed the sad flatline (or, rather, erratic oscillations about a flatline). I decided to try training the new agent against randomly-selected models from the sequence of trainings on the journey TO that “good player” agent; i.e., I offered the new trainee a variety of opponents - some good, some not-so-good, for its battery of training games. Still, no dice - flatline oscillations again.

And again, intuitively, something tells me that there’s something “wrong” or insufficiently clear, here – can an agent actually be trained to skill-match? Surely so… but how. Is this just a case of careful reward engineering (RewArt)? Do I need to try on different algorithms? Is hyperparameter optimization my next important step? Do I need the RL zoo? Or is all of this barking up the wrong tree?

I haven’t given you all the details (how many total timesteps have I tried?, what are the exact reward values?, what does the code look like?, etc.); this message is already pretty long. I was hoping for more general feedback, to point me in a right direction. Surely something like this has been done before, and there are some helpful aids, at least in the academic literature, and I’m just failing to find them? Thank you in advance, anybody who can help.

To summarize, I tried:

  • rewarding “even” scores, step by step (play-by-play), basically by putting the net score in a reward denominator
  • tuning that scheme via coefficient adjustments and such tweaks
  • training the agent against 1) random moves, 2) “accomplished” moves, 3) moves made by randomly-selected models from a previous training session, thus representing more “novice” moves and more “accomplished moves” from one game to the next.