Torch.bincount() ~1000x slower on cuda

Thanks for the reply!