Reproducibility with categorical distribution


I am using a PPO2 agent for RL.
Both my NN and also the agent itself are using categorical distribution.

For me reproducibility is important so I set all the random generator seeds to 0 plus whatever was written regarding cublas and deterministic of pytorch…

the following steps are done:

  1. The seeds are set to 0 at the beginning of the main file.
    2 Inside the main there is a for loop keeps evaluation the trained agent on a backtest data many times after each other…

I’ve just noticed that if I run the backtest loop many times on the same data with the same parameters of the NN I got different results (dropouts shall be inactive due to .eval() mode is set).

I guess probably I observe this due to the categorical distribution and how it is sampled. When I set the toch.manual_seed(0) at the beginning of EACH backtest loop the results are the same.

So I am a bit confused regarding the example of RL at pytorch.
I mean on one end its clear how to train the network but if due to the distribution the result is always a bit different metric-wise its not really clear how to evaluate such an agent.

My questions:

  1. is my understanding correct, that categorical distributions could cause slightly different actions of the agent? (and that this is also a technique to make the agent explore the environment during sample collection?)
  2. If so what is the best practice for training and evaluation? Also when the model is deployed how is it deployed in real life (eg. each time before the model is called torch.manual_seed(0) shall be called?

eg: Shall I set the seed to 0 at each backtest loop and when training restore the seed with
When collecting sample: shall I save at the end of the collection phase with torch.get_rng_state the random generator state and when running the backtest loop set the torch.manual_seed(0)…later when running the next sample collection phase restore the random generator? Also at deployment set the torch.manual_seed(0) each time before calling the agent? Or just simply use the argmax of the action.probabilites?

Thank you for your support!

Yes, you are correct that actions are chosen non-deterministically. When given an observation, the agent returns a policy, which is a probability distribution over all possible actions. An action is then selected by randomly sampling from this distribution. Here’s a brief answer about argmax vs. sampling.

It’s common practice to average the agent’s performance over many episodes. For example, if you scroll to figure 3 in the Schulman et al, 2017, you’ll notice that each curve has “noise” around it, which indicates the range in performance over multiple runs. I’m not sure how manually seeding would help you, but you may find it helpful to check out RL papers that describe their evaluation procedure (e.g. Mnih et al., 2015).

Hi @ArchieGertsman,

Thank you for the your prompt answer.

I’ve already read the referred stack overflow + the PPO arvix…now the DQN doc too (thanks).
Still I have some doubt and maybe I did not get the point and you could bring a bit of enlightenment 2023 for me :).

For me it seems as if they were evaluating the agent performance meanwhile the agent is still learning…at least they keep the exploration “on” during evaluation.

Why I am a bit confused with this is the following: let’s assume that an agent is trained in a simulation (E-greedy or sampling is used to explore the environment). But once the agent is trained it is deployed to a real environment when a random action would be risky (eg. falling from the cliff, or losing money). After a lot of training I guess the epsilon of the greedy will be very low also the .sample maybe will be very “stable” to to the action probls of the distribution. So these impacts will be low…but still there remains some randomness in the actions.

In such an “eval” case in real life when a bad (random) move would mean “the end for the agent” would such a random move still be justifiable during eval instead of argmax?

Thank you!

You’re right, the figures often show how the reward increases over training iterations.

I’d argue that a deterministic action such as taking the argmax over the policy is still risky, as the the learned agent will never be perfect. If there are any actions that you know to be catastrophic, then you can explicitly mask them out, instead of relying on the agent to assign them low probabilities.

Hi @ArchieGertsman,

I’ve found the root cause…and now everything makes sense…also the arguments regarding the evaluation…so I thought I share it with you: I logged the final actions of the agent and how it performed over time which seemed to be just fine…(except for this randomness I faced when the same backtest was performed several times with the same model params). What I did not do was to log the action probabilities. So I had a look at them and they were close to each other…or at least not too far. Which caused the “randomness” when the actions were sampled from the distribution and the argmax stabilized the outcome of the backtest (took out the randomness)…

Thanks for your support!