Hello, so I don’t really get why it is that we need to give a grad tensor to tensor.backwards(). They say the grad should be the gradient of the tensor w.r.t itself but wouldn’t that just be a tensor of all ones?
If not could you please give an example where the gradient wouldn’t be all ones?

I feel like I’m missing something basic here. I’ve never done a vector calc course to be fair.
Also I’ve read this and many similar posts, I still don’t understand it…

In the default use case, yes. If your loss is a scalar value, you don’t need to pass the gradient and it will be set to 1 by default.
However, if your loss is a tensor with more than a single value, you have to pass the gradient manually. If you want to use 1s, you could just use loss.backward(torch.ones_like(loss)).

I understand the API enough I don’t understand the underlying mathematical foundation of it. I’m saying that I thought the derivative of a tensor w.r.t itself would HAVE TO BE a tensor of all 1s.

I assumed this because when you take the gradient you treat all other variables as constants and then my understanding is that you would get a 1 each time for each variable then fill the tensor with that.

Another way of saying it is I don’t see how to the gradient could be anything but a tensor of all 1s.

If you are working on a use case, where you are calling the backward method on the loss and want to use the gradient as dLoss/dLoss, then you shouldn’t care about the gradient value and just use the default 1.
However, other use cases don’t necessarily call backward only on the loss and might call it e.g. on an intermediate activation, so that limiting the gradient to strict ones would be an unnecessary restriction.

But correct me if I’m wrong wouldn’t that mean that if it were the derivative w.r.t. itself it would have to be all ones? And it would only be non-unit if it were the derivative w.r.t. to something earlier (like the loss or a previous layer)?

If you are trying to calculate the dLoss/dLoss, then it would be torch.ones, that’s correct.
For other use cases it might be different. Recently such a use case was described here.

Ah ok, I think that the docs for tensor.backwards() confused me because they said that the arg: gradient := “Gradient w.r.t. the tensor.” (which makes it sound like gradient w.r.t. itself).

I am trying to update the discriminator by multiplying the coffecient to the all computed error. I used thsi code but it gives me the following error.I would apprecite your help.

Error:
one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [64]] is at version 1; expected version 0 instead. Hint: enable anomaly detection to find the operation that failed to compute its gradient, with torch.autograd.set_detect_anomaly(True).

Try to enable anomaly detection and check the operation which is shown in the stack trace.
Based on this line of code check, if inplace operations were executed before this line and remove them.