RuntimeError in backward pass (no backward override)

Carl · August 31, 2017, 5:18pm

Hello!

I’ve stumbled on a RuntimeError during the backward pass of my model. It looks like a bug in autograd, since the forward pass works alright and I did not override any .backward(). Here are the relevant bits of the stack trace:

Traceback (most recent call last):
...
  File "/home/carl/projets/smallnet/trainable.py", line 182, in train_epoch
    loss.backward()
  File "/home/carl/vpriv/lib/python3.5/site-packages/torch/autograd/variable.py", line 156, in backward
    torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
  File "/home/carl/vpriv/lib/python3.5/site-packages/torch/autograd/__init__.py", line 98, in backward
    variables, grad_variables, retain_graph)
RuntimeError: invalid argument 3: sizes do not match at /pytorch/torch/lib/THC/generated/../generic/THCTensorMathPointwise.cu:217

Are you aware of common pitfalls that can trigger that kind of error?
It seems I cannot debug the backward pass; it’s computed with the C library.

Thanks a lot!

SpandanMadan · August 31, 2017, 11:12pm

More context would be helpful. Error log without the code can’t be used to debug. Can you copy paste the code?

albanD · September 1, 2017, 8:32am

Hi,
Two things:

What version of pytorch are you using?
This kind of errors during the backward pass comes from the fact that you modified a Tensor inplace during the forward pass, and this tensor was necessary for backpropagation. As stated by @SpandanMadan we need more context to help you in this case.

Carl · September 1, 2017, 5:50pm

Thank you both for your answers.

My code is quite spread out in numerous files, so I could not include it; also it was hard to do a minimal reproducible example.

I’m using version 0.2.0.

Thank you for suggesting this. It helped me track down the following operation. Here, I just want to modify a parameter inside a module:

self.my_param.data = other_tensor  # where my_param is a nn.Parameter
# other_tensor has a different size than self.my_param.data

Then I do a forward and a backward pass, without any problem. Then, I modify the parameter again. I do a forward pass, then a backward pass, and there I have the RuntimeError. The reason is that .grad is allocated on the first backward pass. The second time I change .data, the size of .data becomes different from the size of .grad.

Thus the error: RuntimeError: invalid argument 3: sizes do not match

My solution to this was to replace the parameter completely instead of assigning to .data:

self.my_param = nn.Parameter(other_tensor)

Now everything works.

Should this be prevented?

If we want to prevent this from happening, there are many options.

Should it be forbidden to assign a new tensor to my_param.data ?
Should it be forbidden to assign a tensor of a different size to my_param.data ?
Should my_param.grad be reassigned automatically when the size of .data changes?
Should the sizes of all .grad be matched to their .data when calling optimizer.zero_grad() ?

Options 2 and 4 make sense to me and seem not to complicated to implement. But in any case, it will add overhead.

Any thoughts?