Understanding NLLLoss function

pawandeep_singh · August 22, 2018, 6:03pm

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 1 x 3
input = torch.tensor([[0,0,0]], requires_grad=True, dtype=torch.float)
# each element in target has to have 0 <= value < C
target = torch.tensor([1])
output = loss(m(input), target)
print(output)

_Output: tensor(1.0986, grad_fn=<NllLossBackward>)_

I read the Documentation but it’s not clear. Can someone explain the math behind this example?

ptrblck · August 22, 2018, 6:14pm

In your example the your output has the same “probability” for all three classes, i.e. the logits have the same value.
Their probability should therefore be approx [0.33, 0.33, 0.33].
Since you are using LogSoftmax we can check, if this is true by calling exp on it (thus getting rid of the log):

print(m(input))
> tensor([[-1.0986, -1.0986, -1.0986]], grad_fn=<LogSoftmaxBackward>)
print(m(input).exp())
> tensor([[0.3333, 0.3333, 0.3333]], grad_fn=<ExpBackward>)

You will get the same values every time you pass the same logits into LogSoftmax.
Now we just have to get the right index using target, multiply with -1, and end up with a loss value of 1.0986.

Calin_Serban · April 19, 2020, 11:36pm

loss = nn.NLLLoss()
a = torch.tensor(([0.88, 0.12], [0.51, 0.49]), dtype = torch.float)
target = torch.tensor([1, 0])
output = loss(a, target)
print(output)

Don’t know if it’s right to post this question here, but I’m trying: Why the output of this piece of code is tensor(-0.3150)? I was expecting to be (-1/2) * ((1 * ln(0.88) + 0 * ln(0.12) + 1 * ln(0.51) + 0 * ln(0.49)), which would be equal to 0.4005, not -0.3150?

I found the formula for log likelihood here

ptrblck · April 20, 2020, 12:20am

nn.NLLLoss expects the inputs to be log probabilities, while you are passing the probabilities into the criterion.

Also, your manual calculation seem to mix the target indices, as the first sample will have the class1 as its target and the second one class0.

Here is an example showing the same result:

loss = nn.NLLLoss()
a = torch.tensor(([0.88, 0.12], [0.51, 0.49]), dtype = torch.float)
target = torch.tensor([1, 0])
output = loss(torch.log(a), target)
print(output)
> tensor(1.3968)
print((-torch.log(a[0, 1]) - torch.log(a[1, 0])) / 2)
> tensor(1.3968)

Calin_Serban · April 20, 2020, 9:04am

Ahh ok, thanks for the answer! What I am trying to figure out actually is how the nn.NLLLoss works for multidimensional tensors, but I couldn’t find an example. Could you give me a simple example on how that loss is calculated for a 2D or 3D tensor?

KFrank · April 20, 2020, 2:42pm

Hi Calin!

Please see (if I understand what you are asking) the description of
the “K-dimensional case” in the documentation for NLLLoss.

Here is an illustrative (pytorch 0.3.0) script:

import torch
torch.__version__

torch.manual_seed (2020)

nBatch = 2
nClass = 4
width = 3
height = 5
input = torch.randn (nBatch, nClass, width, height)
target = torch.multinomial (torch.ones (nClass) / nClass, nBatch * width * height, replacement = True).resize_ (nBatch, width, height)

input.shape
target.shape
target.min()
target.max()

input = torch.autograd.Variable (input)
input = torch.nn.functional.log_softmax (input, dim = 1)
target = torch.autograd.Variable (target)

torch.nn.NLLLoss() (input, target)

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x00000170D6456630>
>>>
>>> nBatch = 2
>>> nClass = 4
>>> width = 3
>>> height = 5
>>> input = torch.randn (nBatch, nClass, width, height)
>>> target = torch.multinomial (torch.ones (nClass) / nClass, nBatch * width * height, replacement = True).resize_ (nBatch, width, height)
>>>
>>> input.shape
torch.Size([2, 4, 3, 5])
>>> target.shape
torch.Size([2, 3, 5])
>>> target.min()
0
>>> target.max()
3
>>>
>>> input = torch.autograd.Variable (input)
>>> input = torch.nn.functional.log_softmax (input, dim = 1)
>>> target = torch.autograd.Variable (target)
>>>
>>> torch.nn.NLLLoss() (input, target)
Variable containing:
 1.9742
[torch.FloatTensor of size 1]

Note that target has one less dimension than input. In particular,
target does not have an nClass dimension, while input does.

Best.

K. Frank

Calin_Serban · April 20, 2020, 6:32pm

Hi Frank, so I took a more simple example for trying to understand:

m = nn.LogSoftmax(dim=1)
loss = nn.NLLLoss()
# input is of size N x C = 2 X 2
input = torch.randn(2, 2, requires_grad=True)
# each element in target has to have 0 <= value < C
target = torch.tensor([1, 0])
output = loss(m(input), target)
print(m(input))
print(target)
print(output)

, and one of its outputs was:

tensor([[-1.1722, -0.3706],
        [-0.5150, -0.9100]], grad_fn=<LogSoftmaxBackward>)
tensor([1, 0])
tensor(0.4428, grad_fn=<NllLossBackward>)

So which is the formula involved in this example for getting the value 0.4428 based on the given input and target, because it’s not really clear for me how l1,…,ln from the l(x, y) formula (official NLLLoss documentation) are calculated since the loss weight is None?

Thanks, Calin!

KFrank · April 20, 2020, 11:37pm

Hello Calin!

Calin_Serban:

tensor([[-1.1722, -0.3706],
        [-0.5150, -0.9100]], grad_fn=<LogSoftmaxBackward>)
tensor([1, 0])
tensor(0.4428, grad_fn=<NllLossBackward>)
So which is the formula involved in this example for getting the value 0.4428

0.4428 = -(-0.3706 + -0.5150) / 2.

That is, the value of your output is the average of the losses for each
of the two samples in you batch.

Quoting from the NLLLoss documentation:

weight ( Tensor , optional ) – a manual rescaling weight given to each class. If given, it has to be a Tensor of size C. Otherwise, it is treated as if having all ones.

Optional means that the argument is allowed to be None, i.e, absent.
In such a case there is no reweighting (or, equivalently, the reweighting
factors are all equal to 1).

Best.

K. Frank

Calin_Serban · April 21, 2020, 8:23am

Thank you very much, Frank!

It’s finally clear now.

shatakshi_raman · July 4, 2020, 10:30am

I don’t know, if I could post a question here. But it’d great to find a soluton because honestly , I don’t understand what wrong here

optimizer = optim.Adam(net.parameters(), lr=0.001 )

EPOCHS = 3

for epoch in range(EPOCHS):

for data in trainset:

X, y = data 

#Make the zero_grad()

net.zero_grad()

#output of the loss created

input = net(X.view(-1, 784)



#calculating the loss

loss = F.nll_loss(input, y)  # calc and grab the loss value

loss.backward()  # apply this loss backwards thru the network's parameters

optimizer.step()  # attempt to optimize weights to account for loss/gradients

print(loss) # print loss. We hope loss (a measure of wrong-ness) declines!

But this keeps showing error at loss = F.nll_loss(input, y) and the error is SyntaxError, how do I solve it ?
I am new to deep learning, but I have experience with sklearn

ptrblck · July 6, 2020, 5:41am

What kind of error are you seeing?
Could you post the complete error message with the stack trace here, please?

Often F.nll_loss creates a shape mismatch error, since for a multi-class classification use case the model output is expected to contain log probabilities (applied F.log_softmax as the last activation function on the output) and have the shape [batch_size, nb_classes]. The target should be a LongTensor in the shape [batch_size] and should contain the class indices in the range [0, nb_classes-1].

pchhapolika · April 4, 2022, 8:17am

When I use single label classification, My labels are either 0 or 1

I get output as:

My model has Linear layer as last layer.

SequenceClassifierOutput(loss=tensor(0.3405, grad_fn=<NllLossBackward>), logits=tensor([[ 0.5105, -0.3917]], grad_fn=<AddmmBackward>), hidden_states=None, attentions=None)

How loss is calculated here from logits?

ptrblck · April 4, 2022, 8:40am

For a single binary output with labels in [0, 1] you should use nn.BCEWithLogitsLoss instead of nn.NLLLoss.
The output of the model should then be a single value containing the logit and should be passed directly to the criterion without applying sigmoid on it.
You could also use sigmoid and nn.BCELoss but the numerical stability would be worse.

pchhapolika · April 4, 2022, 12:17pm

But, how can I explicitly do it here. It has picked up automatically from model?

Also how it calculates to value 0.3405?

model = tr.XLMRobertaForSequenceClassification.from_pretrained("/home/stb/AIML/model_mlm_vocab_exp1_20epocs",problem_type="single_label_classification", num_labels=2,
                                                               ignore_mismatched_sizes=True, id2label={0: 'negative', 1: 'positive'})

training_args = tr.TrainingArguments(
    #report_to = 'wandb',
    output_dir='/home/stb/AIML/results_vocab_ext_exp1',          # output directory
    overwrite_output_dir = True,
    num_train_epochs=10,              # total number of training epochs
    per_device_train_batch_size=10,  # batch size per device during training
    per_device_eval_batch_size=10,   # batch size for evaluation
    learning_rate=2e-5,
    warmup_steps=200,                # number of warmup steps for learning rate scheduler
    weight_decay=0.01,               # strength of weight decay
    logging_dir='./logs_exp1',            # directory for storing logs
    logging_steps=6000,
    evaluation_strategy="epoch"
    ,save_strategy="epoch"
    ,load_best_model_at_end=True
    ,fp16=True
    ,run_name="run1"
    ,gradient_accumulation_steps=20
    
)

ptrblck · April 4, 2022, 6:18pm

I’m not sure which higher-level library you are using, but I would guess that the loss is calculated internally?
If so, you would need to check the internal implementation of this library.