Odd result using multinomial num_samples 1 vs larger sizes

John_Elrick · March 18, 2023, 1:52pm

While following Andrej Karpathy’s video on makemore, I ran into a puzzling result from the call to multinomial.

TL;DR: Using a fixed seed, I am getting a different result for result at position [0] when num_samples is 1 than with any other value. I would expect that the result would be identical. See below code example with output. Is this expected?

Output:

Python version: 3.8.10 (default, Nov 14 2022, 12:59:47) 
[GCC 9.4.0]
PyTorch version: 2.0.0+cu117
tensor([0.0000, 0.1377, 0.1784, 0.2266, 0.2793, 0.3271, 0.3401, 0.3610, 0.3883,
        0.4068, 0.4824, 0.5749, 0.6240, 0.7032, 0.7390, 0.7513, 0.7673, 0.7702,
        0.8214, 0.8855, 0.9264, 0.9288, 0.9405, 0.9501, 0.9543, 0.9710, 1.0000])
With num_samples = 1
Rand tensor([0.7081])
P tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
        0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
        0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])
tensor([10])
Cumulative 0.4823775589466095
Value 0.07560952752828598

With num_samples = 2
Rand tensor([0.7081, 0.3542])
P tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
        0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
        0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])
tensor([13, 19])
Cumulative 0.7031810879707336
Value 0.07923079282045364

With num_samples = 3
Rand tensor([0.7081, 0.3542, 0.1054])
P tensor([0.0000, 0.1377, 0.0408, 0.0481, 0.0528, 0.0478, 0.0130, 0.0209, 0.0273,
        0.0184, 0.0756, 0.0925, 0.0491, 0.0792, 0.0358, 0.0123, 0.0161, 0.0029,
        0.0512, 0.0642, 0.0408, 0.0024, 0.0117, 0.0096, 0.0042, 0.0167, 0.0290])
tensor([13, 19, 14])
Cumulative 0.7031810879707336
Value 0.07923079282045364

Sample code:

import sys
import torch
print(f"Python version: {sys.version}")
print(f"PyTorch version: {torch.__version__}")

# This is a simplified version of the code example from makemore
# I left it as close as possible to the scenario I detected this anomaly
N = torch.zeros((1, 27), dtype = torch.int32)

values = list([0, 4410, 1306, 1542, 1690, 1531,  417,  669,  874,  591, 2422, 2963,
        1572, 2538, 1146,  394,  515,   92, 1639, 2055, 1308,   78,  376,  307,
         134,  535,  929])
for i,x in enumerate(values):
    N[0, i] = x
p = N[0].float()
p = p / p.sum()

# This is a shorthand cumulative function to show where the breakpoints "should" be
cp = torch.cumsum(p, dim=0)
print(cp)

# This function ensures there is no variation based on the scenario being executed
def sanity_check(num_samples):
    print(f"With num_samples = {num_samples}")
    g = torch.Generator().manual_seed(2147483647)
    print(f"Rand {torch.rand(num_samples, generator = g)}")
    print(f"P {p}")
    g = torch.Generator().manual_seed(2147483647)
    output = torch.multinomial(p, num_samples=num_samples, replacement=True, generator=g)
    ix = output[0]
    print(output)
    print(f"Cumulative {cp[ix]}")
    print(f"Value {p[ix]}")
    print()

sanity_check(1)
sanity_check(2)
sanity_check(3)
# Other tests are omitted. All produced the same result for [0]: 13. Only sanity_check(1) produces 10.

ptrblck · March 19, 2023, 12:09am

This might be expected as different code paths could be taken for tensors using a different shape (e.g. vectorized code paths could be taken in the input tensor matches the requirements etc.).
Do you see wrong results in any of these paths?

John_Elrick · March 19, 2023, 12:57am

[NOTE: I started on punch cards in 1981 and have written well over a million lines of application code in my career.)

I’m seeing the exact same seed number producing the exact same output from the generator for any values I tested (I actually used 2,3,4,5,10,20 - all produced the correct [according to the video] output); 13.

However, when I use num_samples = 1, the generator produces the exact same output (0.7081), but that results in a result of index 10 instead of 13 which, as can be seen in the cumulative tensor, should be a value between 0.4824 and 0.5749.

I just can’t determine what could be going on.

ptrblck · March 19, 2023, 5:30am

Again, different code paths might be used and there is no guarantee that all different sequence length will produce exactly the same output samples from the PRNG.
Each sequence for itself will produce the same output after seeding, but you can see the same effect for e.g. torch.randn:

for a in range(1, 18):
    torch.manual_seed(2809)
    print("size {}\notuput {}".format(a, torch.randn(a)))

Output:

size 1
otuput tensor([-2.0748])
size 2
otuput tensor([-2.0748,  0.8152])
size 3
otuput tensor([-2.0748,  0.8152, -1.1281])
size 4
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386])
size 5
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471])
size 6
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538])
size 7
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776])
size 8
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635])
size 9
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434])
size 10
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192])
size 11
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192,  0.6980])
size 12
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192,  0.6980, -0.2340])
size 13
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192,  0.6980, -0.2340, -0.6336])
size 14
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192,  0.6980, -0.2340, -0.6336, -0.7070])
size 15
otuput tensor([-2.0748,  0.8152, -1.1281,  0.8386, -0.4471, -0.5538, -0.8776, -0.5635,
         0.5434, -0.8192,  0.6980, -0.2340, -0.6336, -0.7070,  1.1824])
size 16
otuput tensor([-1.0187e+00,  4.7230e-01,  2.0276e-01,  4.4370e-04,  4.7340e-01,
        -1.5490e+00,  1.4753e+00,  2.8391e-01,  2.9432e-01, -6.6537e-02,
        -1.1419e+00, -4.5612e-01,  1.4221e+00,  8.4476e-01,  5.0450e-01,
        -1.6921e+00])
size 17
otuput tensor([-1.0187, -0.6506,  0.2364, -1.3451, -1.0817,  0.5138, -0.1689, -0.8990,
         1.1120, -0.7482, -0.8071,  1.0999, -1.2457, -1.2504, -0.5546,  0.3743,
         2.2206])

As you can see, the outputs from 1-15 match while starting from 16 another (probably vectorized) code path will be used.

EDIT: in case you want to double check the implementation, you could have a look at this code which should be used for the multinomial kernel on the CPU.

KFrank · March 19, 2023, 7:04am

Hi John!

The expectation that the result be identical is not fully unreasonable. It is
not, however, guaranteed, as @ptrblck explains and illustrates with his
randn() example.

Some further comments:

It does seem that for pytorch version 2.0.0 num_samples = 1 is a special
(and broken-in-performance) case. (See the test code, below.)

I reproduce your results when I run the code you posted on pytorch
version 2.0.0. However, I get the “expected” result (namely, that we
get the same result for position [0] regardless of whether num_samples
is equal to 1) when I run your code on version 1.13.1. Furthermore, a
rather egregious performance bug shows up on 2.0.0.

I performed the following test: I either sample num_samples values
from a multinomial distribution with a single call to torch.multinomial()
or I use a loop to make a single-value call to torch.multinomial()
num_samples times or I use a loop to make a two-value call to
torch.multinomial() num_samples // 2 times.

On 1.13.1 the three approaches yield the same result, while on 2.0.0 the
single-value approach differs significantly.

This is not necessarily an error, because all you can expect is that the
various approaches yield the same statistical distribution of samples,
rather than the same set of sampled values.

(A simple check of the frequency of sampled values looks correct in all
cases, but this doesn’t verify the required statistical independence of the
samples.)

The multi-sample and two-sample calls on 2.0.0, and the multi-sample,
single-sample, and two-sample calls on 1.13.1 all give the exact same
result in my test, while the single-sample-call loop on 2.0.0 gives a
different result (that appears to be statistically valid). Furthermore, the
single-sample-call loop on 2.0.0 appears to consume a different number
of random numbers than the other five computations* and is egregiously
slow.

As expected, the loop approaches are significantly slower than the single
multi-sample calls.

However, the single-sample-loop takes about ten times as long on 2.0.0
as it does on 1.13.1, so something clearly got broken in 2.0.0 in terms of
performance.

Here is the test script:

import torch
print (torch.__version__)

import time

p = torch.tensor ([[1.0 / 6.0, 2.0 / 6.0, 3.0 / 6.0]])

cp = torch.cumsum (p, dim = 1)
print ('cp:', cp)

num_samples = 6000000
print ('num_samples:', num_samples)

g = torch.Generator().manual_seed (2147483647)
print ('rand check:', torch.rand (8, generator = g))
print ('rand check:', torch.rand (1, generator = g))

print ('multi-sample ...')
g = torch.Generator().manual_seed (2147483647)
t0 = time.time()
outputA = torch.multinomial (p, num_samples = num_samples, replacement = True, generator = g)
t1 = time.time()
print ('multi-sample time:', t1 - t0)
print ('post-multi-sample rand:', torch.rand (1, generator = g))

print ('single-sample loop...')
g = torch.Generator().manual_seed (2147483647)
t0 = time.time()
outlist = []
for  i in range (num_samples):
    outlist.append (torch.multinomial (p, num_samples = 1, replacement = True, generator = g))

t1 = time.time()
print ('single-sample-loop time:', t1 - t0)
outputB = torch.cat (outlist, dim = 1)
print ('post-single-sample-loop rand:', torch.rand (1, generator = g))

print ('two-sample loop...')
g = torch.Generator().manual_seed (2147483647)
t0 = time.time()
outlist = []
for  i in range (num_samples // 2):
    outlist.append (torch.multinomial (p, num_samples = 2, replacement = True, generator = g))

t1 = time.time()
print ('two-sample-loop time:', t1 - t0)
outputC = torch.cat (outlist, dim = 1)
print ('post-two-sample-loop rand:', torch.rand (1, generator = g))

print ('torch.equal (outputA, outputB):', torch.equal (outputA, outputB))
print ('torch.equal (outputA, outputC):', torch.equal (outputA, outputC))

print ('multi-sample       (outputA) value counts:', torch.eq (outputA, torch.tensor ([[0, 1, 2]]).T.expand (3, num_samples)).sum (dim = 1))
print ('single-sample-loop (outputB) value counts:', torch.eq (outputB, torch.tensor ([[0, 1, 2]]).T.expand (3, num_samples)).sum (dim = 1))
print ('two-sample-loop    (outputC) value counts:', torch.eq (outputC, torch.tensor ([[0, 1, 2]]).T.expand (3, num_samples)).sum (dim = 1))

Here is the output from version 1.13.1:

1.13.1
cp: tensor([[0.1667, 0.5000, 1.0000]])
num_samples: 6000000
rand check: tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887])
rand check: tensor([0.0080])
multi-sample ...
multi-sample time: 0.12267303466796875
post-multi-sample rand: tensor([0.3459])
single-sample loop...
single-sample-loop time: 32.80804705619812
post-single-sample-loop rand: tensor([0.3459])
two-sample loop...
two-sample-loop time: 19.045663595199585
post-two-sample-loop rand: tensor([0.3459])
torch.equal (outputA, outputB): True
torch.equal (outputA, outputC): True
multi-sample       (outputA) value counts: tensor([1001254, 1999553, 2999193])
single-sample-loop (outputB) value counts: tensor([1001254, 1999553, 2999193])
two-sample-loop    (outputC) value counts: tensor([1001254, 1999553, 2999193])

And here is the 2.0.0 output:

2.0.0
cp: tensor([[0.1667, 0.5000, 1.0000]])
num_samples: 6000000
rand check: tensor([0.7081, 0.3542, 0.1054, 0.5996, 0.0904, 0.0899, 0.8822, 0.9887])
rand check: tensor([0.0080])
multi-sample ...
multi-sample time: 0.11531782150268555
post-multi-sample rand: tensor([0.3459])
single-sample loop...
single-sample-loop time: 307.42153882980347
post-single-sample-loop rand: tensor([0.6761])
two-sample loop...
two-sample-loop time: 20.270005226135254
post-two-sample-loop rand: tensor([0.3459])
torch.equal (outputA, outputB): False
torch.equal (outputA, outputC): True
multi-sample       (outputA) value counts: tensor([1001254, 1999553, 2999193])
single-sample-loop (outputB) value counts: tensor([ 999666, 2000187, 3000147])
two-sample-loop    (outputC) value counts: tensor([1001254, 1999553, 2999193])

*) I haven’t tried to determine why there is a difference in the number of
random numbers consumed by the single-sample approach. Perhaps the
same thing that is causing it to run so slowly when num_samples = 1 is
also causing it to consume (and waste) extra random samples.

Best.

K. Frank

John_Elrick · March 19, 2023, 12:48pm

Thanks to both of you!

ptrbick - I’ll take a glance at the implementation code, however, your explanation makes sense.

KFrank - your replication coupled with a tie in to a new major version of pytorch was perfect. Given that the goal of the random number generator is to, well, produce pseudo-random numbers and not necessarily to achieve perfect replication with both the same seed and different run lengths, a change which causes this type of result is understandable.

As the videos from Karpathy were published only 6 months ago, I hadn’t considered he might be using an earlier version of PyTorch. My concern was that I was getting a different result than he was for the case of num_samples = 1 and wanted to make sure I wasn’t doing something obviously stupid (not that’s every happened to me in the last 42 years).

Thanks again to both of you for clearing up the mystery!

ptrblck · March 19, 2023, 8:59pm

Sure and thank you for providing providing the code and your concerns in the first place!
While changes in the sequence length could dispatch to faster code paths, @KFrank isolated a concerning issue which shows the single-sample-loop approach performs quite poorly, so I’ll take a look at its implementation and the root cause (as this issue wasn’t seen in 1.13.1).
So thanks to both of you!

jaikb · March 23, 2023, 4:31am

Wow this is crazy. I was also just going through Karpathy’s tutorial and came across this exact same issue! When I downgraded my PyTorch version to 1.13.1 and my torchvision to 0.14.1 (for anyone wanting to do the same), it fixed it. So just wanted to say thank you to @John_Elrick for raising this issue and for KFrank for doing that investigation. @ptrblck where should I be checking to know how this issue was caused and when you’ve looked at it?

KFrank · March 23, 2023, 6:00pm

Hi All!

A couple more data points on multinomial()'s consumption of random
numbers:

First, multinomial() can be run in a single-precision or double-precision
mode by passing in an input probability tensor of type float and double,
respectively. (In either case, the multinomial samples returned are of type
long.)

Using pytorch version 1.13.1, both single- and double-precision multinomial
making both single-sample and multiple-sample calls consume the same
number of rands (two per sample) and return the same sample values.

The multiple-sample calls on version 2.0.0 behave the same way.

However, single-precision, single-sample calls on 2.0.0 consume only
one rand per sample (while double-precision, single-sample calls
consume two).

Also note that in my test on 2.0.0, the double-precision, single-sample calls
return every other of the samples that the single-precision, single-sample
calls return. (This may be an artifact of the fact that the samples are
discrete and that it is unlikely that a single-precision value generated from
a single rand would fall close enough to the border of a probability bin
that the nearly-equal double-precision value generated from two rands
would fall instead in the neighboring bin.)

Last, the 2.0.0 single-sample performance bug affects both the
single-precision and double-precision calls, in that both are about
ten times slower than the analogous calls on 1.13.1.

Here is a test script and its 1.13.1 and 2.0.0 outputs:

import torch
print (torch.__version__)

import time

seed = 2023
p = torch.ones (100)

n = 8
nBig = 1000000

g = torch.Generator().manual_seed (seed)
rnds = torch.rand (64, generator = g)   # to see how many rands are consumed
print ('rnds:', rnds)

for  dtype in [torch.float, torch.double]:
    q = p.to (dtype)
    print ('multi-sample, dtype:', dtype)
    for  i in range (1, n + 1):
        g = torch.Generator().manual_seed (seed)
        mlt = torch.multinomial (q, num_samples = i, replacement = True, generator = g)
        r = torch.rand (1, generator = g)
        print ('i:', i, '  r:', r, '  (r == rnds).nonzero():', (r == rnds).nonzero())
    if  dtype == torch.float:
        mlt_multi_float = mlt.tolist()
    else:
        mlt_multi_double = mlt.tolist()
    g = torch.Generator().manual_seed (seed)
    t0 = time.time()
    mlt = mlt = torch.multinomial (q, num_samples = nBig, replacement = True, generator = g)
    print ('nBig:', nBig, '  time =', time.time() - t0)
    print ('single-sample, dtype:', dtype)
    for  i in range (1, n + 1):
        g = torch.Generator().manual_seed (seed)
        mlt_single = []
        for  j in range (i):
            mlt = torch.multinomial (q, num_samples = 1, replacement = True, generator = g)
            mlt_single.append (mlt.item())
        r = torch.rand (1, generator = g)
        print ('i:', i, '  r:', r, '  (r == rnds).nonzero():', (r == rnds).nonzero())
    if  dtype == torch.float:
        mlt_single_float = mlt_single
    else:
        mlt_single_double = mlt_single
    g = torch.Generator().manual_seed (seed)
    t0 = time.time()
    for  j in range (nBig):
        mlt = torch.multinomial (q, num_samples = 1, replacement = True, generator = g)
    print ('nBig:', nBig, '  time =', time.time() - t0)

print ('mlt_multi_float:    ', mlt_multi_float)
print ('mlt_multi_double:   ', mlt_multi_double)
print ('mlt_single_float:   ', mlt_single_float)
print ('mlt_single_double:  ', mlt_single_double)

1.13.1
rnds: tensor([0.4290, 0.7201, 0.9481, 0.4797, 0.5414, 0.9906, 0.4086, 0.2183, 0.1834,
        0.2852, 0.7813, 0.1048, 0.6550, 0.8375, 0.1823, 0.5239, 0.2432, 0.9644,
        0.5034, 0.0320, 0.8316, 0.3807, 0.3539, 0.2114, 0.9839, 0.6632, 0.7001,
        0.0155, 0.3840, 0.7968, 0.4917, 0.4324, 0.5174, 0.6913, 0.1628, 0.5692,
        0.0938, 0.3054, 0.1259, 0.7719, 0.6046, 0.9558, 0.0861, 0.7213, 0.0747,
        0.0035, 0.4003, 0.5900, 0.1179, 0.7419, 0.6117, 0.9598, 0.4614, 0.9241,
        0.1411, 0.9587, 0.9025, 0.5650, 0.2162, 0.9765, 0.6009, 0.5165, 0.3156,
        0.3671])
multi-sample, dtype: torch.float32
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 0.04687786102294922
single-sample, dtype: torch.float32
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 4.480212926864624
multi-sample, dtype: torch.float64
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 0.050467491149902344
single-sample, dtype: torch.float64
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 4.6825549602508545
mlt_multi_float:     [43, 58, 33, 26, 46, 25, 23, 45]
mlt_multi_double:    [43, 58, 33, 26, 46, 25, 23, 45]
mlt_single_float:    [43, 58, 33, 26, 46, 25, 23, 45]
mlt_single_double:   [43, 58, 33, 26, 46, 25, 23, 45]

2.0.0
rnds: tensor([0.4290, 0.7201, 0.9481, 0.4797, 0.5414, 0.9906, 0.4086, 0.2183, 0.1834,
        0.2852, 0.7813, 0.1048, 0.6550, 0.8375, 0.1823, 0.5239, 0.2432, 0.9644,
        0.5034, 0.0320, 0.8316, 0.3807, 0.3539, 0.2114, 0.9839, 0.6632, 0.7001,
        0.0155, 0.3840, 0.7968, 0.4917, 0.4324, 0.5174, 0.6913, 0.1628, 0.5692,
        0.0938, 0.3054, 0.1259, 0.7719, 0.6046, 0.9558, 0.0861, 0.7213, 0.0747,
        0.0035, 0.4003, 0.5900, 0.1179, 0.7419, 0.6117, 0.9598, 0.4614, 0.9241,
        0.1411, 0.9587, 0.9025, 0.5650, 0.2162, 0.9765, 0.6009, 0.5165, 0.3156,
        0.3671])
multi-sample, dtype: torch.float32
i: 1   r: tensor([0.7201])   (r == rnds).nonzero(): tensor([[1]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 0.05341219902038574
single-sample, dtype: torch.float32
i: 1   r: tensor([0.7201])   (r == rnds).nonzero(): tensor([[1]])
i: 2   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 3   r: tensor([0.4797])   (r == rnds).nonzero(): tensor([[3]])
i: 4   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 5   r: tensor([0.9906])   (r == rnds).nonzero(): tensor([[5]])
i: 6   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 7   r: tensor([0.2183])   (r == rnds).nonzero(): tensor([[7]])
i: 8   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
nBig: 1000000   time = 45.47208642959595
multi-sample, dtype: torch.float64
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 0.046894073486328125
single-sample, dtype: torch.float64
i: 1   r: tensor([0.9481])   (r == rnds).nonzero(): tensor([[2]])
i: 2   r: tensor([0.5414])   (r == rnds).nonzero(): tensor([[4]])
i: 3   r: tensor([0.4086])   (r == rnds).nonzero(): tensor([[6]])
i: 4   r: tensor([0.1834])   (r == rnds).nonzero(): tensor([[8]])
i: 5   r: tensor([0.7813])   (r == rnds).nonzero(): tensor([[10]])
i: 6   r: tensor([0.6550])   (r == rnds).nonzero(): tensor([[12]])
i: 7   r: tensor([0.1823])   (r == rnds).nonzero(): tensor([[14]])
i: 8   r: tensor([0.2432])   (r == rnds).nonzero(): tensor([[16]])
nBig: 1000000   time = 42.14270281791687
mlt_multi_float:     [43, 58, 33, 26, 46, 25, 23, 45]
mlt_multi_double:    [43, 58, 33, 26, 46, 25, 23, 45]
mlt_single_float:    [49, 23, 38, 85, 5, 3, 71, 25]
mlt_single_double:   [23, 85, 3, 25, 28, 9, 39, 60]

Best.

K. Frank