Duplicate results due to lack of entropy using multinomial without replacement?

jeffc · July 28, 2024, 11:31am

I’m trying to confirm that I’ve found a bug in multinomial without replacement. It seems to only use a 32 bit random number as a ‘seed’ for its own random number generator for choosing all the samples. This means at most there are only 2^32 permutations it will generate without replacement, even when there are far more possible. I wrote a short app showing that multinomial without replacement always generates at least one DUPLICATE PERMUTATION every few 10s of thousands of iterations. Am I missing something?

Here’s the app:

import torch
letters = [*"abcdefghijklmnopqrstuvwxyz"]
p = torch.tensor([1.0]*len(letters))
found = {}
for count in range(1, 1000000):
    t = torch.multinomial(p, num_samples=len(letters), replacement=False).tolist()
    result = ''.join(letters[n] for n in t)
    if result in found:
        print (f"We have generated {count:,} strings and the most recent one was '{result}'")
        print (f"This is the same as the {found[result]:,} generated string which was also '{result}'")
        break
    found[result] = count

The output looks like this:

We have generated 31,869 strings and the most recent one was 'kmfuqyesxvngithplacwbjdroz'
This is the same as the 1,375 generated string which was also 'kmfuqyesxvngithplacwbjdroz'

KFrank · July 29, 2024, 2:33am

Hi Jeff!

I can confirm your result, and I do believe that it is a bug.

This isn’t the cause (but I don’t know what the cause is).

torch.multinomial() uses by default pytorch’s “global” random-number
generator which, on the cpu, is (from memory) a mersenne twister and
has a period that is vastly greater than 2**32.

I don’t this that this is true. I speculate (but have not confirmed) that
multinomial() will generate all permutations with non-zero probability,
but does so sufficiently non-uniformly that duplicates are generated
much too frequently.

I don’t think that you are missing anything. I see the same thing
with a similar script.

I will start a new thread with the details to highlight this bug you’ve
found.

[Edit: This is the new thread.]

Best.

K. Frank

jeffc · July 29, 2024, 4:04am

Thanks!

Here’s why I’m suspicious that multinomial without replacement is only using 32 bits of entropy. I looked at the output of the random number generator using a predictable seed, and it appeared that multinomial was only consuming 32 bits from it without replacement, regardless of how many elements you ask it to sample.

Here’s an example that shows the output of rand with a predictable seed, then it tries again and shows how many values were consumed after a multinomial call which pulls 10 samples between 1 and 100 without replacement.

import torch

g = torch.Generator().manual_seed(2147483647)
print(torch.rand(16, generator=g))
print()

g = torch.Generator().manual_seed(2147483647)
print(torch.multinomial(torch.tensor([1.]*100), num_samples=10,replacement=False, generator=g))
print()
print(torch.rand(16, generator=g))

The output shows that only the first random value was consumed.

tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080,
        0.2908, 0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176])

tensor([32, 69, 47, 83, 10, 41, 37, 42, 72, 28])

tensor([0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887, 0.0080, 0.2908,
        0.7408, 0.4012, 0.8640, 0.7391, 0.3845, 0.2176, 0.7054])

I tried this with a double precision probability tensor, and it did consume two entries (64 bits of randomness), but in my birthday problem test even with a double precision probability tensor it is still hitting collisions like it is generating the samples with only 32 bits of entropy. My suspicion is that it’s using 32 bits from the random number generator to seed its own generator that pulls the samples.

KFrank · July 29, 2024, 3:20pm

Hi Jeff!

Yes, I agree. I’ve replied in more detail in the “bug-report” thread.

By the way, this looks like a regression that didn’t occur in version
1.11.0.

Best.

K. Frank