A simple but key question about back propagation chain rule in pytorch

Zichun_Zhang · December 18, 2018, 3:02pm

from __future__ import print_function
from torch.autograd import Variable
import torch

xx = Variable(torch.Tensor([2]), requires_grad = True)
yy = 3*xx
zz = yy**2

yy.register_hook(print)
zz.register_hook(print)
tar = Variable(torch.Tensor([100]), requires_grad = False)
loss = zz-tar
loss.backward()
#normally here follows optimizer.step()

Recently, I found that I did not even get straight with the real clear meaning of the back propagation and the optimizer.step()

So I tried the above code, what I felt confused is that no matter what the target is : the grad of yy, zz has the same value:

tar = Variable(torch.Tensor([100]), requires_grad = False)
tar = Variable(torch.Tensor([1000000]), requires_grad = False)
tar = Variable(torch.Tensor([1000000000000]), requires_grad = False)

tensor([ 1.])    for zz.grad
tensor([ 6.])     for yy.grad

But these two sure change when xx change,

xx = Variable(torch.Tensor([4]), requires_grad = True)

and

tensor([ 1.])   for zz.grad
tensor([ 24.])  for yy.grad

So I feel confuse that:
How could it possible that the grad of computation graph node has no relation to target, from my perspective, if the loss is larger, the grad shall become larger as well, so they would optimize more in the optimizer.step()

Am I too confused about the basic knowledge about grad back propagation?

Will optimizer such as torch.optim.SGD() would add connection between loss and grad?

Thanks a lot for reading all above!

ptrblck · December 18, 2018, 3:30pm

The derivative of your loss function wrt the parameters is independent of tar.
If you use a loss function like loss = (zz-tar)**2, you’ll see the behavior you are expecting.

PS: I’m not a huge fan of tagging people, as this might demotivate others to post a solution.

Zichun_Zhang · December 19, 2018, 1:36am

thx, I would take your suggestion’

Zichun_Zhang · December 19, 2018, 2:05am

But as you can see, my above code is like MAE:

So if use MAE, the grad of each variable has no relation with the final loss, why people still use it ? (since the grad of yi is constant)

And so is cross entropy loss (since the grad of yi is tar/yy , not containing loss)

ptrblck · December 19, 2018, 9:53am

The solution to the MAE criterion gives you the median, while the solution to the MSE criterion gives you the mean. In the former case you’ll get a robust solution, which means that your solution is robust to outliers, but might have multiple minima. So even though your derivative is partially constant, it’s still a valid approach, at least for some problems. Note that you’ll still get different signs for predictions smaller or greater than your median. I have to admit I’ve never used the MAE criterion for a “real” problem, so I’m not sure how it’ll perform using a deep learning model.
Hastie et al. describe both criteria way better than I can possibly do it in Elements of Statistical Learning, page 18-20.

In the case of cross entropy loss, the derivative would be prob_i - target_i, where prob_i is the softmax output at position i and target_i is the target value at position i.
Here is a small manual example:

batch_size = 2
nb_classes = 5
logits = torch.randn(batch_size, nb_classes, requires_grad=True)
target = torch.empty(batch_size, dtype=torch.long).random_(nb_classes)

grad = F.softmax(logits, 1)
grad[torch.arange(batch_size), target] -= 1
grad = grad/batch_size

# Compare with PyTorch implementation
ce = F.cross_entropy(logits, target)
ce.backward()

print('Manual grad:\n{}\nPyTorch grad:\n{}'.format(
    grad, logits.grad))