Hello, hoping for a little insight. I’m writing a loss function that calculates, among other things, entropy of images. I’m using torch.bincount to create a grey-level co-occurance matrix. The function is nearly 1000x slower on cuda when the input tensor contains a large number of zeros. Here’s some toy code to minimally reproduce the problem:

```
a = torch.randint(0, 6500, (100000,))
b = torch.zeros(100000).long()
for x in range(2700):
idx = np.random.choice(100000)
num = np.random.choice(6500)
b[idx] = num
print('a unique values', len(torch.unique(a)))
print('b unique values', len(torch.unique(b)))
print('a max', a.max())
print('b max', b.max())
print('a # of zeros: ', (a == 0).sum())
print('b # of zeros: ', (b == 0).sum())
print('CPU Bincount:')
for i, x in enumerate([a,b]):
time1 = time.time()
torch.bincount(x)
time2 = time.time()
print('Tensor {}, bincount time: {:4f} s'.format(i, (time2-time1)))
print('Cuda Bincount:')
for i, x in enumerate([a,b]):
x = x.cuda()
torch.cuda.synchronize()
time1 = time.time()
torch.bincount(x)
torch.cuda.synchronize()
time2 = time.time()
print('Tensor {}, bincount time: {:4f} s'.format(i, (time2-time1)))
```

Output:

a unique values 6500

b unique values 2198

a max tensor(6499)

b max tensor(6499)

a # of zeros: tensor(19)

b # of zeros: tensor(97343)

CPU Bincount:

Tensor 0, bincount time: 0.002553 s

Tensor 1, bincount time: 0.002548 s

Cuda Bincount:

Tensor 0, bincount time: 0.001399 s

Tensor 1, bincount time: 1.475042 s

Can anyone shed some light on this? Does this have anything to do with the “non-deterministic” behavior of bincount() on cuda?