Very weird error of shape mismatch between same exact tensors

Fawaz_Sammani · October 14, 2019, 3:10am

Hi. I’m working on some code on NLP where I require to create specific masks. However, when i do this operation: a[a==1] = 5, it tells me that a and a==1 are not of same shape. This happens after the 8th epoch, meaning that it was working for 8 epochs but not after? Moreover, it happens randomly, maybe at the half or beginning of the training in the 8th epoch or maybe it doesn’t happen at all. If I clear the output and run the epoch again, it dosen’t happen, but happens at the following epoch. This is a showcase of my problem:

original_tensor = torch.randn(batch_size,max_sen_len)         
tensor_c = original_tensor.detach()
max_value, _ = torch.max(tensor_c,1)      
max_value = max_value.unsqueeze(1)            
mask = torch.ones_like(tensor_c)   
mask[tensor_c<max_value] = 0
mask[tensor_c>=max_value] = 1
mask2 = mask.clone()
values_batch = max_value.clone().squeeze(1)
mask2[mask2 == 1] = 5

When i get the error, it’s like this:
shape mismatch: value tensor of shape [80] cannot be broadcast to indexing result of shape [81]

Even if i place assert mask2.shape == (mask2==1).shape, the assertion will not run, as they are correct, but PyTorch thinks they aren’t at a particular point. So I have no clue what’s wrong. Can anybody advice on this strange phenomenon?

spanev · October 14, 2019, 2:09pm

Hi @Fawaz_Sammani,

What’s the PyTorch version? Did you try with another one?

What are the mask2 values when it fails? Is there (at least) a 1 in mask2?

Can you try also try with:

mask2 = mask.clone().detach()

Fawaz_Sammani · October 14, 2019, 3:52pm

Hi @spanev
I’m using the latest version (1.2). I haven’t tried with another one.
According to the implementation, there should always be a 1. As there is always a maximum value when calling this line: max_value, _ = torch.max(tensor_c,1) . I will try with .detach() for mask2, but it shouldnt matter as mask2 is a copy of mask which is a tensor like tensor_c, which is already detached from the graph.

spanev · October 14, 2019, 3:54pm

actually 1.3 was released last week, please try with this one, just in case.

yes, doesn’t really make sense but better the eliminate weirdest hypothesis.

albanD · October 14, 2019, 3:56pm

Hi,

Could you give a full stack trace for the error please?

Fawaz_Sammani · October 14, 2019, 4:36pm

Hi @albanD. As I said, it happens randomly and at specific times only. Once it happens again during the training, I’ll give you the full stack trace of the error. Thanks a lot!

Fawaz_Sammani · October 15, 2019, 6:19am

Hi @albanD and @spanev. Here is what I’m getting. Before that please note that my original_tensor is a masked softmax. For convenience, I’ll just repost the whole code:

    before_softmax = before_softmax.masked_fill(sentence_mask == 0, -1e10) 
    original_tensor  = F.softmax(before_softmax , dim = 1)
    tensor_c = original_tensor.detach()
    max_value, _ = torch.max(tensor_c,1)      
    max_value = max_value.unsqueeze(1)            
    mask = torch.ones_like(tensor_c)   
    mask[tensor_c<max_value] = 0
    mask[tensor_c>=max_value] = 1
    mask2 = mask.clone().detach()     # as per spanev advice

    values_batch = max_value.clone().squeeze(1)
    try:
        mask2[mask2 == 1] = values_batch * 2
    except Exception:
        print("Unknown Error occured...will be ignored for now")
        print(mask2.shape, (mask2==1).shape, values_batch.shape)
        print(traceback.format_exc())
        print(traceback.print_stack())

And here is what I get:
Epoch: [0][0/7081] Loss 11.5036 (11.5036) Top-3 Accuracy 0.000 (0.000)
Epoch: [0][100/7081] Loss 3.4485 (4.0602) Top-3 Accuracy 50.215 (45.270)
Epoch: [0][200/7081] Loss 3.1352 (3.6904) Top-3 Accuracy 54.885 (49.896)
Epoch: [0][300/7081] Loss 3.1681 (3.5049) Top-3 Accuracy 56.575 (52.313)
Epoch: [0][400/7081] Loss 3.1377 (3.3968) Top-3 Accuracy 57.095 (53.771)
Epoch: [0][500/7081] Loss 3.2094 (3.3248) Top-3 Accuracy 57.419 (54.772)
Epoch: [0][600/7081] Loss 2.9907 (3.2681) Top-3 Accuracy 58.225 (55.546)

Unknown Error occured...will be ignored for now
torch.Size([80, 14]) torch.Size([80, 14]) torch.Size([80])
Traceback (most recent call last):
  File "<ipython-input-1-66e4467ea31a>", line 326, in forward
    mask2[mask2 == 1] = values_batch * 2 
RuntimeError: shape mismatch: value tensor of shape [80] cannot be broadcast to indexing result of shape [81]

  File "C:\Users\USER\Anaconda3\lib\runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "C:\Users\USER\Anaconda3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\USER\Anaconda3\lib\site-packages\spyder_kernels\console\__main__.py", line 11, in <module>
    start.main()
  File "C:\Users\USER\Anaconda3\lib\site-packages\spyder_kernels\console\start.py", line 318, in main
    kernel.start()
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\kernelapp.py", line 563, in start
    self.io_loop.start()
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\platform\asyncio.py", line 148, in start
    self.asyncio_loop.run_forever()
  File "C:\Users\USER\Anaconda3\lib\asyncio\base_events.py", line 534, in run_forever
    self._run_once()
  File "C:\Users\USER\Anaconda3\lib\asyncio\base_events.py", line 1771, in _run_once
    handle._run()
  File "C:\Users\USER\Anaconda3\lib\asyncio\events.py", line 88, in _run
    self._context.run(self._callback, *self._args)
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\ioloop.py", line 690, in <lambda>
    lambda f: self._run_callback(functools.partial(callback, future))
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\ioloop.py", line 743, in _run_callback
    ret = callback()
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\gen.py", line 787, in inner
    self.run()
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\gen.py", line 748, in run
    yielded = self.gen.send(value)
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 365, in process_one
    yield gen.maybe_future(dispatch(*args))
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 272, in dispatch_shell
    yield gen.maybe_future(handler(stream, idents, msg))
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\kernelbase.py", line 542, in execute_request
    user_expressions, allow_stdin,
  File "C:\Users\USER\Anaconda3\lib\site-packages\tornado\gen.py", line 209, in wrapper
    yielded = next(result)
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\ipkernel.py", line 294, in do_execute
    res = shell.run_cell(code, store_history=store_history, silent=silent)
  File "C:\Users\USER\Anaconda3\lib\site-packages\ipykernel\zmqshell.py", line 536, in run_cell
    return super(ZMQInteractiveShell, self).run_cell(*args, **kwargs)
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2855, in run_cell
    raw_cell, store_history, silent, shell_futures)
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 2881, in _run_cell
    return runner(coro)
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\async_helpers.py", line 68, in _pseudo_sync_runner
    coro.send(None)
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3058, in run_cell_async
    interactivity=interactivity, compiler=compiler, result=result)
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3249, in run_ast_nodes
    if (await self.run_code(code, result,  async_=asy)):
  File "C:\Users\USER\Anaconda3\lib\site-packages\IPython\core\interactiveshell.py", line 3326, in run_code
    exec(code_obj, self.user_global_ns, self.user_ns)
  File "<ipython-input-1-66e4467ea31a>", line 754, in <module>
    word_map = word_map)
  File "<ipython-input-1-66e4467ea31a>", line 511, in train
    This is the main class which includes all other classes (the one which is called during training)
  File "C:\Users\USER\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-1-66e4467ea31a>", line 489, in forward
    This is forward function of the class i'm running
  File "C:\Users\USER\Anaconda3\lib\site-packages\torch\nn\modules\module.py", line 547, in __call__
    result = self.forward(*input, **kwargs)
  File "<ipython-input-1-66e4467ea31a>", line 331, in forward
    print(traceback.print_stack())
None

Epoch: [0][700/7081] Loss 3.2939 (3.2216) Top-3 Accuracy 57.391 (56.224)

An then it continues training. Note that this code is ran on Windows (as this is my personal PC). But I also ran this code yesterday on my other PC (which is Linux) and the same error occured. So i don’t think its a matter of the OS.

If there is no current solution, is there another function in Pytorch to do the same as mask2[mask2 == 1] = 5? Maybe torch.index_select?

Hope I can get your kind help!

tom · October 15, 2019, 7:43am

You could try making your mask of integral dtype. Floating point and == are not a good idea in general.

Best regards

Thomas

Fawaz_Sammani · October 18, 2019, 8:18am

Hi @tom. That would work if I had values_batch * 2, but actually the number I have is a float, as in: values_batch * 0.5. So if i convert the tensor to torch.uint8 , my floating point would get rounded to the nearest integer. I’ll try to figure some other way around to do the indexing. Thanks for your help!

Fawaz_Sammani · October 18, 2019, 3:53pm

Hi @albanD, @spanev and @tom,
The problem was fixed when converting the PyTorch tensor to numpy, doing this operation and then converting back to PyTorch tensor, as follows:

    mask2 = mask.clone().cpu().numpy()
    values_batch = value.squeeze(1).cpu().numpy()
    mask2[mask2 == 1] = values_batch * 0.5
    mask2= torch.from_numpy(mask2).to(device)

Hope this issue could be fixed in future PyTorch releases. Thanks!

albanD · October 18, 2019, 4:12pm

Would you be able to give us a small code sample (with hardcoded tensor values) that reproduce the problem please?

Fawaz_Sammani · October 18, 2019, 5:11pm

@albanD It happened again. Numpy gave the same error!
NumPy boolean array indexing assignment cannot assign 60 input values to the 61 output values where the mask is true

albanD · October 18, 2019, 5:39pm

A self contained example would be very useful for us to help you.

Fawaz_Sammani · October 18, 2019, 7:07pm

Hi @albanD. I actually figured out the problem. The softmax is giving same values sometimes. So when i run:

mask[tensor_c < max_value] = 0
mask[tensor_c >= max_value] = 1

I get this:

tensor([[0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 1., 0., 1., 1., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 1., 1., 0., 1., 0., 0.],
        [0., 0., 0., 0., 1., 1., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 1., 1., 0., 1., 1., 1., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 1., 1., 1., 0., 0., 0., 0., 0.]])

Therefore,
The number of elements in mask2[mask2 == 1] would not be equal to the number of elements in values_batch which is the batch size (13 in this example), since there are more than one “1” for each batch.

Is there a clean way to solve this? For example, mask[tensor_c>=max_value] = 1 only inserts a “1” one time, and if the same value is repeated, it will just ignore placing a 1. I assume this happens at the beginning of training only, then softmax would be more accurate and produce a single max value.

albanD · October 18, 2019, 7:21pm

To solve this, I would use the indices returned by the max op and do scatter(your_dim, ind, 1)

Fawaz_Sammani · October 18, 2019, 7:33pm

@albanD Thanks! It works perfect. With the scatter function:

With using: mask.scatter_(1, max_indices.unsqueeze(1), 1)

tensor([[0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 0., 0., 0., 1., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.],
        [0., 0., 0., 0., 0., 0., 1., 0., 0., 0., 0., 0.]])

So the scatter places a “1” randomly in one of the arguments and ignores the rest if the same specific value is repeated? Or is there any condition where the “1” locations to be placed are in that case of repetition?

albanD · October 18, 2019, 7:44pm

No it’s just that the max returns the index of one of the max value (no guarantee which one).
The scatter just puts a 1 at that index.

Fawaz_Sammani · October 18, 2019, 7:53pm

Thanks alot @albanD! Appreciate your help!