Assigning tensor to multiple rows on GPU

Hello!

So I recently updated to Pytorch 1.11 from 1.9 and the below code started throwing errors (also shown below) on GPU only. When using CPU, this works fine. The weird part of this is I am doing boolean indexing. Integer list indexing also throws the same error. When slicing, this works perfectly. If I modify the tensor I am using to set the rows to be the same shape as the number of rows I am assigning to, it works again. Is there a proper way to use this method to assign multiple rows in a tensor without creating an exactly matching size tensor? I assume this is supposed to work since it works on CPU.

x = torch.zeros((5,4), device=torch.device('cuda:0'))
x[[False,True,False,True,True]] = torch.tensor([1.0, 1.0, 1.0, 1.0], device=torch.device('cuda:0'), dtype=torch.float32)
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1646755897462/work/aten/src/ATen/native/cuda/Indexing.cu":268, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 12 vs 4

Hi Ken!

This appears to a known issue, but specifically in the context of using
“deterministic” cuda algorithms. It looks like some work-arounds are
discussed in the relevant (closed?!) github issue:

I can reproduce your issue, both in version 1.11 and in a recent nightly
(1.13.0.dev20220604), but only if I set:

torch.use_deterministic_algorithms (True)
>>> import torch
>>> torch.__version__
'1.11.0'
>>> torch.version.cuda
'11.3'
>>> torch.cuda.get_device_name()
'GeForce GTX 1050 Ti'
>>> x = torch.zeros((5,4), device=torch.device('cuda:0'))
>>> x[[False,True,False,True,True]] = torch.tensor([1.0, 1.0, 1.0, 1.0], device=torch.device('cuda:0'), dtype=torch.float32)
>>> x
tensor([[0., 0., 0., 0.],
        [1., 1., 1., 1.],
        [0., 0., 0., 0.],
        [1., 1., 1., 1.],
        [1., 1., 1., 1.]], device='cuda:0')
>>> torch.use_deterministic_algorithms (True)
>>> x[[False,True,False,True,True]] = torch.tensor([1.0, 1.0, 1.0, 1.0], device=torch.device('cuda:0'), dtype=torch.float32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
RuntimeError: linearIndex.numel()*sliceSize*nElemBefore == expandedValue.numel()INTERNAL ASSERT FAILED at "/opt/conda/conda-bld/pytorch_1646755903507/work/aten/src/ATen/native/cuda/Indexing.cu":268, please report a bug to PyTorch. number of flattened indices did not match number of elements in the value tensor: 12 vs 4

Best.

K. Frank

CC @eqy could you take a look at this failure as you’ve worked on the last fix?
This might be a new issue (previously untested) or the same error popped up again.