Calling torch.distributions.categorical.Categorical multiple times can affect the final result

Calling torch.distributions.categorical.Categorical multiple times can affect the final result.

In the PPO (Proximal Policy Optimization) algorithm, the action function is defined as follows:

def choose_action(self, s, evaluate=False):
    with torch.no_grad():
        s = torch.tensor(s, dtype=torch.float).unsqueeze(0)
        logits = self.ac.actor(s.cuda())
        if evaluate:
            a = torch.argmax(logits)
            return a.item(), None
        else:
            dist = Categorical(logits=logits)
            a = dist.sample()
            a_logprob = dist.log_prob(a)
            return a.item(), a_logprob.item()

In practice, it is observed that different calls to Categorical result in different outcomes. For example:

import torch
from torch.distributions import Categorical
m = Categorical(torch.tensor([0.25, 0.25, 0.25, 0.25]))
for _ in range(10000):
    m.sample()

ppo = PPO()
env = Env(config, task='test')
evaluate(ppo, env)

# Output result: reward = 10
ppo = PPO()
env = Env(config, task='test')
evaluate(ppo, env)

# Output result: reward = 100

Hi Qingze!

I believe that you’re asking why calls to Categorical.sample() (and
thing computed from them) aren’t reproducible.

This is to be expected as .sample() returns pseudorandom samples
from the Categorical distribution you instantiated. These samples
are supposed to be (randomly) different from one another. If you want
a series of .sample()s to be reproducible, you may reset the underlying
random-number generator with torch.manual_seed(). Consider:

>>> import torch
>>> torch.__version__
'2.2.2'
>>> _ = torch.manual_seed (2024)
>>> m = torch.distributions.Categorical(torch.tensor([0.25, 0.25, 0.25, 0.25]))
>>> m.sample(), m.sample()         # two samples
(tensor(2), tensor(1))
>>> m.sample(), m.sample()         # two more samples -- different result
(tensor(3), tensor(2))
>>> _ = torch.manual_seed (2024)   # reset random number generator
>>> m.sample(), m.sample()         # same two samples as at the beginning
(tensor(2), tensor(1))

Best.

K. Frank

I may have caused some confusion for you.

choose_action() is a member funcion of PPO class. I delare a ppo object, and call its choose_action() in evaluate() function
if call .sample() before evaluate() many times, ppo give a better reward

I know that .sample() is not repoducible, but the problem is that .sample() may have memory, call .sample() many times before call choose_action(), will give me a more better reward.

I have set seed by torch.manual_seed(seed)

Hi Qingze!

This is just* a (pseudo) statistical happenstance cause by the different,
but statistically-equivalent, (pseudo) random samples you get after
“advancing the random-number-generator state” by calling .sample()
a bunch of times (and discarding the results) before performing your
actual computation.

.sample() (or, more precisely, an instance of Categorical) does not
have memory. (Pytorch’s random-number generator does have state.
That state changes the specific values of the samples generated, but
not their statistical characteristics.)

Try this experiment, say ten to one hundred times. Start a new python
session and import pytorch. This initializes pytorch’s random-number
generator to a new, “random,” state. Either compute your reward or
call .sample() a bunch of times and then compute your reward. You
will find that there is no statistical difference between the rewards you
get with and without calling .sample() a bunch of times.

*) There is some lore that the quality of the pseudorandom numbers
initially produced by pytorch’s random-number generator is reduced if
seed has only low-order bits. So, if you want to be extra-careful about
this (I don’t bother.), you should use a seed that comes from basically
the whole range of a 64-bit integer.

Best.

K. Frank.