Relationship between loss and gradient?

pytorcher · March 15, 2018, 5:57pm

Hello, I’m still fairly new to pytorch, trying to understand autograd. I am using L1 loss to keep things simple, which results in a loss value as expected. But the gradient is not as expected. Using L1 loss in my old framework meant the initial gradients were just output-targets. However pytorch is giving fixed values of .25*direction:

grad_output during backward_hook of top layer with 2 nodes and a batch size of 2:

(Variable containing:
(0 ,0 ,.,.) = 
  0.2500

(0 ,1 ,.,.) = 
 -0.2500

(1 ,0 ,.,.) = 
  0.2500

(1 ,1 ,.,.) = 
 -0.2500
[torch.cuda.FloatTensor of size 2x2x1x1 (GPU 0)]
,)

Output:

Variable containing:
 0.5952  0.3170
 0.5666  0.6178
[torch.cuda.FloatTensor of size 2x2 (GPU 0)]

Targets:

(Pdb) targets
Variable containing:
 0  1
 0  1
[torch.cuda.LongTensor of size 2x2 (GPU 0)]

Loss is as expected:

(Pdb) (output-targets).abs().mean()
Variable containing:
 0.5567
[torch.cuda.FloatTensor of size 1 (GPU 0)]

(Pdb) loss
Variable containing:
 0.5567
[torch.cuda.FloatTensor of size 1 (GPU 0)]

jpeg729 · March 15, 2018, 8:24pm

Initial gradients equal to output - targets corresponds to 0.5 * MSELoss, not to L1Loss.

If you differentiate MSELoss by hand you get 2 * (output - targets).

If you differentiate L1Loss (= abs(output - targets)) with respect to output then you get either +1 or -1 depending on whether output - targets is positive or negative. The fact you see .25 rather than 1 in the actual gradients is because PyTorch takes the average loss per observation, rather than their sum.

pytorcher · March 15, 2018, 10:33pm

Ah I see, I guess I did not understand L1 loss. MSE loss seems to be what I was looking for, and I also did not know you had to multiply/divide based on batch size. Thanks!

jpeg729 · March 15, 2018, 11:24pm

All the “code a simple neural net” articles that I have seen to date just use output-targets as the initial gradients and don’t explain the correspondance to mse loss.

Averaging over the batch size avoids having to adjust the learning rate to the batch size. Otherwise big batch -> big sum(loss) -> big update, and big infrequent updates aren’t always a great idea.