Bug in torch.multinomial -- generated distribution is modestly incorrect [Edit: This is a regression and appears to be due to an analogous bug in Tensor.exponential_().]

Hi Forum!

@jeffc has found a bug in torch.multinomial() – see his thread
Duplicate results due to lack of entropy using multinomial without replacement?

(@ptrblck, if this looks like a bug to you, could you pass it on to the
powers that be? I haven’t filed this as a github issue because I still
haven’t set up a new github account.)

The concept is that you can use multinomial() without replacement
to generate random permutations and that these permutations should
be uniformly generated. (That is, any specific permutation of n
“letters” should be generated with probability 1 / n!.)

In @jeffc’s example. he generates permutations of 26 letters. Were
the permutations generated uniformly, it turns out that you would
have to generate about 10**13 permutations in order to have a 50%
chance of generating a duplicate, but duplicates show up in sets of
only tens of thousands of samples.

(I don’t have any intuition about what kind of flaw in multinomial()'s
distribution would lead to this effect. I also don’t see any coarse flaw
in multinomial()'s distribution, but somehow generating permutations
is picking up on some sort of minor fishiness.)

Here is a script that reproduces the bug @jeffc found. It also uses
randperm() to generate permutations and shows that these are
much more uniform.

import torch
print (torch.__version__)

high_bits_for_seed = 16000000000000000000           # to use "good quality" seed
_ = torch.manual_seed (high_bits_for_seed + 2024)

n = 26                                              # permutations of 26 "letters"
k = 1000000                                         # number of permutations to generate

print ('n:', n)
print ('k:', k)

# generate random permutations using multinomial() -- not quite uniform

prob = torch.ones (n)
dups_mult = 0
perm_counts_mult = {}
for _ in range (k):
    p = tuple (torch.multinomial (prob, n, replacement=False).tolist())
    if  p in perm_counts_mult:
        dups_mult += 1
        perm_counts_mult[p] += 1
    else:
        perm_counts_mult[p] = 1

print ('duplicate multinomial perms: ', dups_mult)
print ('multiple multinomial perms:  ', (torch.tensor (list (perm_counts_mult.values())) > 1).sum().item())
print ('max of perm_counts_mult:     ', torch.tensor (list (perm_counts_mult.values())).max().item())
print ('len (perm_counts_mult):      ', len (perm_counts_mult))

_ = torch.manual_seed (high_bits_for_seed + 2024)   # just to be consistent

# generate random permutations using randperm() -- much more uniform

dups_rand = 0
perm_counts_rand = {}
for _ in range (k):
    p = tuple (torch.randperm (n).tolist())
    if  p in perm_counts_rand:
        dups_rand += 1
        perm_counts_rand[p] += 1
    else:
        perm_counts_rand[p] = 1

print ('duplicate randperm perms:    ', dups_rand)
print ('multiple randperm perms:     ', (torch.tensor (list (perm_counts_rand.values())) > 1).sum().item())
print ('max of perm_counts_rand:     ', torch.tensor (list (perm_counts_rand.values())).max().item())
print ('len (perm_counts_rand):      ', len (perm_counts_rand))

And here is its output:

2.4.0
n: 26
k: 1000000
duplicate multinomial perms:  207
multiple multinomial perms:   207
max of perm_counts_mult:      2
len (perm_counts_mult):       999793
duplicate randperm perms:     0
multiple randperm perms:      0
max of perm_counts_rand:      1
len (perm_counts_rand):       1000000

I don’t really think that it’s the cause, but this multinomial() github issue
might be related. Note, passing the input probabilities as float64
(which would presumably cause multinomial() to process its
probability “buckets” in double precision) does not seem to help the
situation.

Best.

K. Frank

1 Like

My suspicion is that multinomial without replacement is using 32 bits of entropy from PyTorch’s global random number generator, and then using that as a seed for a separate random number algorithm that picks the samples. As a result, regardless of how many permutations are theoretically possible, there are at most only 2^32 permutations multinomial without replacement will ever generate for a given set of parameters (the same single 32 bit seed will yield the same result from multinomial without replacement regardless of the number of samples requested or the number possible values for each sample).

One way you can see how multinomial is using the same amount of entropy regardless of the number of samples/possibilities, is to run the above example with n=100 instead of n=26. You’ll see that you get the exact same number of duplicates (and if you checked, they’d occur at the exact same iteration).

Here’s an explicit example showing that with the seed from the above example, if you call multinomial without replacement for the 13,676th and 86,743rd random number in the sequence, it will generate the exact same results

import torch
high_bits_for_seed = 16000000000000000000
_ = torch.manual_seed (high_bits_for_seed + 2024)
torch.rand(13675)
print(torch.multinomial(torch.ones(100), 10))
torch.rand(86742-13675-1)
print(torch.multinomial(torch.ones(100), 10))

Here’s the output:

tensor([49, 15, 21, 31, 17, 24, 47, 20, 43, 28])
tensor([49, 15, 21, 31, 17, 24, 47, 20, 43, 28])

And it does so at the same iterations even if you change it to pick 20 numbers without replacement from 0-999 instead of 10 numbers from 0-99:

import torch
high_bits_for_seed = 16000000000000000000
_ = torch.manual_seed (high_bits_for_seed + 2024)
torch.rand(13675)
print(torch.multinomial(torch.ones(1000), 20))
torch.rand(86742-13675-1)
print(torch.multinomial(torch.ones(1000), 20))

Here are those results:

tensor([835, 696, 235, 854, 782, 468, 504, 831,  49, 508, 967, 242,  15, 110,
         21, 395, 729, 842, 224, 519])
tensor([835, 696, 235, 854, 782, 468, 504, 831,  49, 508, 967, 242,  15, 110,
         21, 395, 729, 842, 224, 519])

Even if you use a double precision probability distribution, it seems to only use 32 bits of entropy - It looks like it pulls of 64 bits of randomness from the global generator, but uses only the second 32 bits:

import torch
high_bits_for_seed = 16000000000000000000
_ = torch.manual_seed (high_bits_for_seed + 2024)
torch.rand(13675-1)
print(torch.multinomial(torch.ones(1000).double(), 20))
torch.rand(86742-13675-1-1)
print(torch.multinomial(torch.ones(1000).double(), 20))

The output is the same as the above example that used a float32 probability distribution:

tensor([835, 696, 235, 854, 782, 468, 504, 831,  49, 508, 967, 242,  15, 110,
         21, 395, 729, 842, 224, 519])
tensor([835, 696, 235, 854, 782, 468, 504, 831,  49, 508, 967, 242,  15, 110,
         21, 395, 729, 842, 224, 519])

Hi Jeff!

First, this looks like a regression. Here’s the results of running my
test script on version 1.11.0 (instead of the current stable, 2.4.0):

1.11.0
n: 26
k: 1000000
duplicate multinomial perms:  0
multiple multinomial perms:   0
max of perm_counts_mult:      1
len (perm_counts_mult):       1000000
duplicate randperm perms:     0
multiple randperm perms:      0
max of perm_counts_rand:      1
len (perm_counts_rand):       1000000

Second, I’ve traced this down to Tensor.exponential_(). This is
called by torch.multinomial() (at the ATen level), which displays
similar behavior.

Yes, I agree with this. The script you posted in your first thread makes
this clear.

Here is a tweaked version of that script that also runs .exponential_():

import torch
print (torch.__version__)

g = torch.Generator().manual_seed(2147483647)
print(torch.rand(16, generator=g))
print()

g = torch.Generator().manual_seed(2147483647)
print ('generate 10 multinomial samples:')
print(torch.multinomial(torch.tensor([1.]*100), num_samples=10,replacement=False, generator=g))
print()
print(torch.rand(16, generator=g))
print()

g = torch.Generator().manual_seed(2147483647)
print(torch.rand(16, generator=g))
print()

g = torch.Generator().manual_seed(2147483647)
print ('generate 10 exponential samples:')
print (torch.empty (10).exponential_ (generator=g))
print()
print(torch.rand(16, generator=g))

And here is its output, first on version 2.4.0:

2.4.0
tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080,
        0.2908, 0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176])

generate 10 multinomial samples:
tensor([32, 69, 47, 83, 10, 41, 37, 42, 72, 28])

tensor([0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080, 0.2908,
        0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176, 0.7054])

tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080,
        0.2908, 0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176])

generate 10 exponential samples:
tensor([0.2398, 0.5397, 0.3355, 0.2256, 0.2360, 0.8463, 3.0238, 0.5873, 0.8899,
        1.5785])

tensor([0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080, 0.2908,
        0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176, 0.7054])

And then on 1.11.0:

1.11.0
tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080,
        0.2908, 0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176])

generate 10 multinomial samples:
tensor([78, 83, 32, 44,  3,  4, 20,  7, 84, 90])

tensor([0.2097, 0.1018, 0.2837, 0.3573, 0.7845, 0.4066, 0.3308, 0.6559, 0.3364,
        0.6619, 0.5475, 0.3593, 0.5875, 0.6502, 0.3760, 0.2517])

tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080,
        0.2908, 0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176])

generate 10 exponential samples:
tensor([1.0937, 1.8544, 1.2857, 0.0597, 0.0664, 2.6050, 2.4298, 0.0789, 1.0302,
        1.1383])

tensor([0.7074, 0.0554, 0.8117, 0.3059, 0.9928, 0.3419, 0.2070, 0.6071, 0.8581,
        0.3224, 0.5998, 0.1621, 0.3729, 0.5870, 0.3176, 0.3409])

I still doubt that multinomial() is seeding its own random-number
generator – that’s just not how pytorch generally works – but I
haven’t verified this.

I am sort of speculating that multinomial() (or, under the hood,
.exponential_()) is generating a single 32-bit random variate
(with a 23-bit mantissa) and somehow using that to generate all
the samples produced by single call (but I haven’t been able to
find the actual .exponential_() implementation to see what is
really going on).

Yes, I agree.

As an aside, the state of the (global) generator is not the same at
the 13,676th and 86,743rd positions. It’s just that the 32-bit random
value (whether packaged as an int or float) “happens” to be the
same.

Best.

K. Frank

That’s really interesting that it is exponential_ which seems to be the root cause, and multinomial surfaces the bug just because it has a dependency on exponential_. It definitely broadens the scope of code that might be impacted.

I am not sure, but I think the bug may originate with the C++ implementation of exponential_ which may be in this file - pytorch/aten/src/ATen/native/cpu/DistributionKernels.cpp at 4c2bcf92cbecd36b7881904bceb8dc50c9b9741d · pytorch/pytorch · GitHub

The function exponential_kernel() seems to get a seed from the PyTorch random number generator, then use it as the seed to a different random number generator.

void exponential_kernel(TensorIteratorBase &iter, double lambda, std::optional<Generator> gen) {
...
        seed = generator->random();
...
            vslNewStream(&stream, VSL_BRNG_MCG31, seed);
            vslSkipAheadStream(stream, begin);
            vdRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len,
              (double *)(sample_ptr + begin), eps, 1./lambda);
            vslDeleteStream(&stream);
...

This means that exponential_ has only a limited set of 2^32 different series it will generate when you ask it to generate more than one value. You can see this by calling exponential with different counts at the point where the PyTorch random number generator has the same 32 bit value:

import torch
high_bits_for_seed = 16000000000000000000
_ = torch.manual_seed (high_bits_for_seed + 2024)
torch.rand(13675)
print(torch.empty(10).exponential_())
torch.rand(86742-13675-1)
print(torch.empty(20).exponential_())

Here are the results:

tensor([0.9412, 0.2365, 1.2177, 0.2239, 2.1906, 1.0692, 0.1377, 1.4104, 1.2070,
        0.2321])
tensor([0.9412, 0.2365, 1.2177, 0.2239, 2.1906, 1.0692, 0.1377, 1.4104, 1.2070,
        0.2321, 2.2380, 0.3541, 1.4934, 0.9233, 0.7194, 0.0163, 0.2370, 0.0354,
        1.6427, 0.2042])

This means surprisingly that “0.9412, 0.2365, 1.2177, 0.2239, …” is one of 2^32 ways that exponential_ will always start.

I also noticed that bernoulli_scalar_kernel() in the same file also uses a seed to initialize vslNewStream(). Sure enough, it seems to suffer from the same bug. Altering the above to generate 50 ‘0’ or ‘1’ values using the bernoulli method, you see it also generates the same results:

import torch
high_bits_for_seed = 16000000000000000000
_ = torch.manual_seed (high_bits_for_seed + 2024)
torch.rand(13675)
print(torch.empty(50).bernoulli(0.5))
torch.rand(86742-13675-1)
print(torch.empty(50).bernoulli(0.5))

Here is the output:

tensor([1., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0.,
        1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1.,
        1., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0.])
tensor([1., 0., 1., 0., 1., 1., 0., 1., 1., 0., 1., 0., 1., 1., 1., 0., 0., 0.,
        1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 0., 1., 0., 0., 1., 0., 1.,
        1., 0., 1., 0., 1., 1., 1., 0., 1., 0., 1., 0., 1., 0.])

For bernoulli - It appears that this issue has been in the code for at least three years (it was in the initial version of the file). For exponential_, it seems to have been added with this commit:

I don’t understand the codebase enough to know if the timelines where the issue was introduced varies on GPU vs CPU.

I also didn’t see other uses of vslNewStream which might indicate places where a value from one random number generator might have been used as a seed for another, but this is definitely a pattern that should be looked for and eliminated.

Also here is a small bug that hopefully will go away when the issue is fixed. exponential_kernel() tries to initialize a 32 bit or 64 bit seed based on the precision of the tensor, but vslNewStream looks like it takes a 32 bit seed. This explains why you get the same results for float32 and double tensors:

...
    int64_t seed;
    {
      // See Note [Acquire lock when using random generators]
      std::lock_guard<std::mutex> lock(generator->mutex_);
      if (self.scalar_type() == at::kDouble)
        seed = generator->random64();
      else
        seed = generator->random();
    }
...
          if constexpr (std::is_same<scalar_t, double>::value) {
            vslNewStream(&stream, VSL_BRNG_MCG31, seed);
            vslSkipAheadStream(stream, begin);
            vdRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len,
              (double *)(sample_ptr + begin), eps, 1./lambda);
            vslDeleteStream(&stream);
          } else {
            vslNewStream(&stream, VSL_BRNG_MCG31, seed);
            vslSkipAheadStream(stream, begin);
            vsRngExponential(VSL_RNG_METHOD_EXPONENTIAL_ICDF, stream, len,
              (float *) (sample_ptr + begin), eps, 1./lambda);
            vslDeleteStream(&stream);
          }

Here are the docs for vslNewStream - https://www.intel.com/content/www/us/en/docs/onemkl/developer-reference-c/2023-2/vslnewstream.html

Hi Jeff!

I believe that you are correct. Great detective work! I wasn’t able to
drill down through all the layers to get to the actual implementation.

Yes. (It turns out that you were right about this all along …)

This strikes me as a big problem. As a general rule – and this has
been known since before I was born (Yikes!) – it’s a bad idea to use
one pseudo-random-number generator to seed another.

Yes. This strikes me as a big problem. By today’s standards, 2**32 is
a “small” number. One could easily exhaust such a “small” set of
pseudo-random variates with – by today’s standards – a medium-scale
computation (that I could run on my cell phone …).

You can use torch.random.get_rng_state() to get the state of
pytorch’s (cpu) global generator. It’s a ByteTensor of length 5056,
which is to say the state consists of 632 32-bit words. (This would
be typical of a 32-bit mersenne-twister generator, which I’m pretty
sure is what pytorch uses on the cpu.)

So .exponential_() (and bernoulli, etc.) take that 632-word state
and pump it through a bottleneck of just a single word. You’ve thrown
away almost all of the state that made the mersenne twister a good
generator and you can’t get that state back (even if the generator
you seed with that single word is a good generator). This really goes
against everything we know (and that we’ve learned the hard way)
about good practice when using pseudo-random generators.

Part of what makes these kinds of bugs so pernicious is that they
don’t obviously catch your eye nor show up in quick tests. In your
case, you had to look for collisions of random permutations derived
from multinomial() when generating tens of thousand of samples.

I agree completely. I understand that the need for speed is a competing
consideration, but not at the expense of correctness. Although I’m not
a fan of seeding one generator with another, there are legitimate use
cases. But if you’re going to do so, you have to be much more careful.
(And squeezing your generator through a 32-bit “entropy bottleneck”
doesn’t meet the test of adequate care.)

Thanks very much for finding this.

Best.

K. Frank

Thanks for the tag and the great explanation and debugging, @KFrank and @jeffc!

This sounds like a bug indeed and should be fixed.
I’ve created a GitHub issue here so we can track it with the code owners.

Wow! A really bright end eagle eyed colleague of mind noticed that vslNewStream is initialized with VSL_BRNG_MCG31. This means there are not just a limited number of possible streams that exponential_ will generate, but there is effectively a single stream of (2^31-1) values, and exponential_ just returns some section of that one stream.

Here’s what happens:

  • There exists only one series of about 2 billion possible results.
  • When you ask exponential_() to give you N random numbers from the exponential distribution, the current Python PRNG is used to pick a 32 bit seed which chooses a starting location in the sequence of 2 billion results.
  • You then get the next N items from that sequence.

This means any sequence returned from exponential_ is just a block of N sequential elements from the larger series of 2 billion. You’ll always see the same exact series of numbers generated in the same exact order.

Here’s a sample that demonstrates this by generating 100 numbers from the distribution. It then sits in a loop generating about 2 million numbers at a time. I believe the odds that each sequence generated will contain the initial sequence of 100 is about 1 in 1024. Sure enough, this always finds a hit exactly as expected (pardon my brute force and hacky implementation of finding a tensor in another tensor, but I’m new to Python and PyTorch as of last week)

import torch
sample = torch.empty(100).exponential_();
print(sample.tolist())
firstValue = sample[0].item()
count = 0
testSize = 1024*1024*2
print(f'Testing by filling tensor of size {testSize:,} with exponential_().')
print(f'Probability of initial sequence showing up in generated tensor about 1 in {2147483648.0/testSize}')
while True:
    count += 1
    testTensor = torch.empty(testSize).exponential_()
    allFound = True
    for s in sample:
        if not s in testTensor:
            allFound = False
            break

    if not allFound:
        continue

    print(f'checking possible hit at test {count:,}.')
    test = testTensor.tolist()
    indices = [i for i in range(len(test)) if test[i] == firstValue]
    found = False
    for i in indices:
        if(i <= (len(test)-100) and test[i] == sample[0].item() and test[i+1] == sample[1].item() and test[i+2] == sample[2].item() and test[i+3] == sample[3].item()):
            print(f'Found initial sequence in new tensor after {count:,} attempts.')
            print(f'At index {i:,} of new tensor found this series:')
            print(test[i:i+100])
            found = True
            break
    if found:
        break

Here are example results (truncated a bit)

[1.5481778383255005, 1.0027738809585571, 1.3017827272415161, 1.2531036138534546, 0.6990818381309509, 0.6651406288146973, 1.7746243476867676, 1.1834914684295654, 0.21791832149028778, 0.6379857659339905, 0.11356551200151443, ...
Testing by filling tensor of size 2,097,152 with exponential_().
Probability of initial sequence showing up in generated tensor about 1 in 1024.0
checking possible hit at test 889.
Found initial sequence in new tensor after 889 attempts.
At index 1,157,902 of new tensor found this series:
[1.5481778383255005, 1.0027738809585571, 1.3017827272415161, 1.2531036138534546, 0.6990818381309509, 0.6651406288146973, 1.7746243476867676, 1.1834914684295654, 0.21791832149028778, 0.6379857659339905, 0.11356551200151443, ...

The initial tensor of 100 elements from exponential_ was found in the results after 889 subsequent calls to exponential_ where each call asked for tensor of 2,097,152 elements from exponential_

As you can see, every result from exponential_ is just part of the same long list of 2 billion pre-determined values.

bernoulli, of course, suffers from the same use of VSL_BRNG_MCG31.

Jeff

Hi Jeff!

Good find!

This just seems perverse. Why, in today’s world, would somebody
use by default a generator with a period of “only” 2^31?

It looks like there is some progress. This github commit supposedly
turns off “MKL” (and VSL_BRNG_MCG31) until a long-term solution
is decided upon. (This commit is referenced in the associated
github issue.)

(Just to note, this patch seems not yet to have hit the current nightly,
2.5.0.dev20240803.)

I don’t know the details of how these things work, so this is just off the
top of my head:

It appears that MKL also has a mersenne-twister generator,
VSL_BRNG_MT19937. If performance of pytorch’s current mersenne
twister (used on the cpu) is an issue and MKL is faster, perhaps the
long-term solution would be simply to elevate VSL_BRNG_MT19937
to be pytorch’s global (cpu) generator.

Things like .exponential_() (as well as all other random-number
consumers) would get the desired performance gain, you would have
the same quality of random numbers as now, and you would no longer
have the issue of seeding a “local” generator for .exponential_()
with the global generator.

(While nowhere near as bad as using VSL_BRNG_MCG31, seeding
a local mersenne twister with the global mersenne twister would still
not be a best practice – just use the global generator directly for things
like .exponential_() with the expectation that VSL_BRNG_MT19937
would address any performance issues.)

Best.

K. Frank

P.S. Doing our part to save NVIDIA from its own 21st-century
RANDU fiasco. Or to quote Donald Knuth, “its very name RANDU is
enough to bring dismay into the eyes and stomachs of many computer
scientists!”