from __future__ import print_function
from torch.autograd import Variable
import torch
xx = Variable(torch.Tensor([2]), requires_grad = True)
yy = 3*xx
zz = yy**2
yy.register_hook(print)
zz.register_hook(print)
tar = Variable(torch.Tensor([100]), requires_grad = False)
loss = zz-tar
loss.backward()
#normally here follows optimizer.step()

Recently, I found that I did not even get straight with the real clear meaning of the back propagation and the optimizer.step()

So I tried the above code, what I felt confused is that no matter what the target is : the grad of yy, zz has the same value:

tar = Variable(torch.Tensor([100]), requires_grad = False)
tar = Variable(torch.Tensor([1000000]), requires_grad = False)
tar = Variable(torch.Tensor([1000000000000]), requires_grad = False)
tensor([ 1.]) for zz.grad
tensor([ 6.]) for yy.grad

But these two sure change when xx change,

xx = Variable(torch.Tensor([4]), requires_grad = True)
and
tensor([ 1.]) for zz.grad
tensor([ 24.]) for yy.grad

So I feel confuse that: How could it possible that the grad of computation graph node has no relation to target, from my perspective, if the loss is larger, the grad shall become larger as well, so they would optimize more in the optimizer.step()

Am I too confused about the basic knowledge about grad back propagation?

Will optimizer such as torch.optim.SGD() would add connection between loss and grad?

The derivative of your loss function wrt the parameters is independent of tar.
If you use a loss function like loss = (zz-tar)**2, youâ€™ll see the behavior you are expecting.

PS: Iâ€™m not a huge fan of tagging people, as this might demotivate others to post a solution.

The solution to the MAE criterion gives you the median, while the solution to the MSE criterion gives you the mean. In the former case youâ€™ll get a robust solution, which means that your solution is robust to outliers, but might have multiple minima. So even though your derivative is partially constant, itâ€™s still a valid approach, at least for some problems. Note that youâ€™ll still get different signs for predictions smaller or greater than your median. I have to admit Iâ€™ve never used the MAE criterion for a â€śrealâ€ť problem, so Iâ€™m not sure how itâ€™ll perform using a deep learning model.
Hastie et al. describe both criteria way better than I can possibly do it in Elements of Statistical Learning, page 18-20.

In the case of cross entropy loss, the derivative would be prob_i - target_i, where prob_i is the softmax output at position i and target_i is the target value at position i.
Here is a small manual example:

batch_size = 2
nb_classes = 5
logits = torch.randn(batch_size, nb_classes, requires_grad=True)
target = torch.empty(batch_size, dtype=torch.long).random_(nb_classes)
grad = F.softmax(logits, 1)
grad[torch.arange(batch_size), target] -= 1
grad = grad/batch_size
# Compare with PyTorch implementation
ce = F.cross_entropy(logits, target)
ce.backward()
print('Manual grad:\n{}\nPyTorch grad:\n{}'.format(
grad, logits.grad))