Computing gradient of a value w.r.t a vector 0.4.0 vs 1.7

I try to adapt code which was written on pytorch 0.4.0 to a later version of torch 1.7.1,
And in the old code I had:

        grad_target = (output_cl * label).sum()
        grad_target.backward(gradient=label * output_cl, retain_graph=True)

But in my new pytorch version it complains about the fact that grad_target has dimension ([]) and label*output_cl is a vector.

Here is the full traceback:

Traceback (most recent call last):
File “C:\Program Files\JetBrains\PyCharm 2021.1\plugins\python\helpers\pydev\”, line 1483, in _exec
pydev_imports.execfile(file, globals, locals) # execute the script
File “C:\Program Files\JetBrains\PyCharm 2021.1\plugins\python\helpers\”, line 18, in execfile
exec(compile(contents+"\n", file, ‘exec’), glob, loc)
File “C:/Users/Student1/PycharmProjects/new/”, line 42, in
File “C:\Users\Student1\PycharmProjects\new\”, line 99, in train_handler
File “C:\Users\Student1\PycharmProjects\new\”, line 369, in train
File “C:\Users\Student1\PycharmProjects\new\”, line 317, in forward
return self._forward(data, label, extra_super, am_mask)
File “C:\Users\Student1\PycharmProjects\new\”, line 531, in forward
output_cl, loss_cl, gcams = self.attention_map_forward(data, labels)
File “C:\Users\Student1\PycharmProjects\new\”, line 489, in attention_map_forward
grad_target.backward(gradient=label * output_cl, retain_graph=True)
File “C:\Users\Student1\anaconda3\envs\new\lib\site-packages\torch\”, line 221, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph)
File "C:\Users\Student1\anaconda3\envs\new\lib\site-packages\torch\autograd_init
.py", line 126, in backward
= make_grads(tensors, grad_tensors)
File "C:\Users\Student1\anaconda3\envs\new\lib\site-packages\torch\autograd_init
.py", line 37, in _make_grads
+ str(out.shape) + “.”)
RuntimeError: Mismatch in shape: grad_output[0] has a shape of torch.Size([1, 20]) and output[0] has a shape of torch.Size([]).

How can I change it preserving the original logic?

I wonder if I should just remove the gradient argument.

Thank you.

The problem is that grad_target contains a single scalar value (that’s what the size [] means).
But label * output_cl is a whole vector. The gradient should be the same size as the Tensor, so a single scalar value here.

1 Like

Yes, I’ve already understood it from other similar issues. It worked in this way with pytorch==0.4.0, I guess the meaning of this was “derive results w.r.t each element in the vector”, so to remove it will do the work, don’t you think @albanD ?

At least after trying it, I see it gives me same results as in the old version, which I did manage to run on google Collab.

Yes, you should be able to remove that gradient argument.

1 Like

@albanD, just final clarification, it will also be equivalent to write:

        grad_target = (output_cl * label)
        grad_target.backward(gradient=label * output_cl, retain_graph=True)

Because doing backward on the sum of output_cl * label with respect to itself as much as I understand is equivalent to doing backward on each multiplication element-wise, i.e minimizing each multiplication element-wise, which is minimal i.f.f the sum minimal, am I wrong?
Is it equivalent ? If it doesn’t, so what is the meaning of this derivation comparing to the first one ?
Appreciate your answering :slight_smile:

That will be equivalent to the sum if you do grad_target.backward(gradient=torch.ones_like(grad_target)

1 Like

@albanD, now that O think about it, what does the ones vector mean?
Because suppose the true labels are [0 0 1 0 0 1].
and by multiplying output_cl * label I will get [0 0.5 0 0.5 0.5 0.5].

So the logic of derivation according to the multiclass labels and optimization problem should be:
[0 0.5 0 0.5 0.5 0.5].backward(gradient=[0 0 1 0 0 1]) isn’t it ?
because I want 1’s only in the true labels position (suppose in is one hot encoding), and not [1 1 1 1 1 1].


The gradient argument here is not the gradient of the op you just did. But the gradient flowing back from the later functions.
If you want to see it differently, it computes gradient^T J where J is the jacobian of your function that contains the derivatives.

1 Like

I see. So in the first time we backward, in the gradient argument it is naturally desired USUALLY one/s.

@albanD, so correct if I wrong the gradient argument is useful, suppose in any situation when I calculated the gradient somehow manually for example, and now I want to continue with the gradient calculation, it is kind a manual hook to do back propagation, but usually the desired behavior is putting value/vector/matrix of ones ?

Any other scenarios of necessity of this argument and it’s use will be great to look at and understand it better, thank you.


In 99.9% of the cases, the output loss is scalar and so what you want here is a single 1 (which is the default value) as this will compute the full gradient of your function.
If there is more than one output, then there is no natural value to set so it has to be provided by the user.

Then the value you set depends what you want to do, there is no “usual” value in this case.
I said a vector full of 1s because that corresponds to the sum() you were asking about. But that’s it.

There are very little cases where this arg is actually needed.
The main one I know of is for backpropagation through time were you want to stage the backwardprop of each step independently so that you can run just the right number.
In that case, you will need to manually chain the backward calls and you can use this argument for that.


Excellent, thank you very much!