I have a simple Linear model and I need to calculate the loss for it. I applied two CrossEntropyLoss and NLLLoss but I want to understand how grads are calculated on these both methods.
On the output layer, I have 4 neurons which mean I am going to classify on 4 classes.
L1 = nn.Linear(2,4)
When I use CrossEntropyLoss I get grads for all the parameters:
That is right. But actually what I wanted is that, how grads are calculated based on softmax and without it?
Let’s suppose that we have the output like [.1, .2, .3, .4] and the target is [3], could you tell me how the grads are really calculated based on CrossEntropyLoss and pure NLLLoss? I cannot understand how grads are calculated?
It can also be fine if tell me on your own example.
Thanks
The second approach shows the manual calls and you should be able to recompute the backward pass based on the used operations (torch.exp, torch.sum, indexing etc.).
Before addressing a possible point of confusion, let me emphasize that you
should not feed the output of a Linear directly into NLLLoss.
The output of a Linear will typically include positive numbers, which, when
interpreted as log-probabilities, correspond to invalid “probabilities” that are
greater than one. If you permit “probabilities” greater than one, NLLLoss
will return negative values, will be unbounded below, and your training will
diverge.
This is to be expected (depending on the details of your use case).
NLLLoss plucks out the log-probability of the “true” class (as specified
by the passed-in target), and doesn’t depend on the non-true-class
probabilities. So those gradients are zero.
Consider:
>>> import torch
>>> torch.__version__
'1.13.0'
>>> _ = torch.manual_seed (2022)
>>> log_probs = torch.rand (1, 5, requires_grad = True)
>>> targ = torch.tensor ([2]) # "true" class is 2
>>> loss = torch.nn.NLLLoss() (log_probs, targ)
>>> loss
tensor(-0.7588, grad_fn=<NllLossBackward0>)
>>> loss.backward()
>>> log_probs.grad # only non-zero for "true" class
tensor([[ 0., 0., -1., 0., 0.]])
>>> with torch.no_grad(): # change values for some "non-true" classes
... log_probs[0, 0] = 66.6
... log_probs[0, 4] = 99.9
...
>>> torch.nn.NLLLoss() (log_probs, targ) # loss doesn't change
tensor(-0.7588, grad_fn=<NllLossBackward0>)
When you pass the output of Linear through log_softmax() (or softmax(),
for that matter), it mixes the classes together so that the “true”-class value
(that NLLLoss plucks out) depends on all of the outputs of Linear and you
get non-zero gradients for all elements of weight (and bias).