Pytorch formula for NLL Loss

jp_letendre · February 24, 2019, 7:39pm

Hi,

I was wondering why the negative log likelihood function (NLLLoss()) in torch.nn expected a target. The torch.nn.NLLLoss() uses nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction) in his forward call. If NLL has the format : , why is the target vector needed to compute this, and not just the output of our nn.Softmax() layer?

Thanks,

JP

ptrblck · February 24, 2019, 9:06pm

In your formula I assume y gives the target class prediction. Have a look at the nn.CrossEntropyLoss docs to see the applied formula.

jp_letendre · February 24, 2019, 9:20pm

Thank you for your answer,

So the formula below describes the CrossEntropyLoss function implemented in Pytorch.

From what I understand, this function never uses the label (or target) of the sample to compute the probability output for class, since x[class] refers to the probability of our sample belonging to class, and sum(exp (x[j]) is just the sum over the exponential of outputs for all classes. So where does Pytorch actually use the label of the sample in this loss function? I feel I’m missing something

ptrblck · February 24, 2019, 9:22pm

By indexing x with class you are indeed using your target. Without the target you won’t know which logit to use.

jp_letendre · February 24, 2019, 9:40pm

Got it, thanks!

Related to this, I’m actually trying to tweak this loss function. I’d like to replace the target (y_o,c) in the loss function below by the prediction probability output for a specific class p_o,c, basically turning this

into this

I tried simply replacing the target argument by outputs again when calling a CrossEntropyLoss() object, but Pytorch expects a datatype torch.long, not a torch.float as it is the case with outputs. Do you have any idea how I could obtain this new loss function?

Thanks,

JP

ptrblck · February 24, 2019, 9:50pm

If you are looking for label smoothing, this thread might have an interesting code snippet.
Alternatively, you could just write out the formula.
I’m not sure, how p_o,c is defined, but I guess it should be the probability of class c for output o?

-1. * (target * F.log_softmax(x, 1)[target]).sum()

jp_letendre · February 24, 2019, 10:02pm

If we’re talking mini-batch training, p_o,c would be an array of size (batch_size x num_classes), representing for each sample in th mini-batch it’s probability distribution over all possible classes. My concern was that I wanted a backpropagatable loss function. I’ll try that line and come back to you.

rasbt · February 25, 2019, 7:14am

I think this is incorrect and nll_loss already assumes that you pass log(probabilities). You test this by giving it just probabilities (onehot format with values in range [0, 1]) and it will give you negative values.

jp_letendre · February 25, 2019, 10:17pm

I ended up modifying the function a bit for my needs, but it worked! Thanks

EGanji · August 16, 2024, 8:35pm

These two losses (1, 2) are equal:

        loss1 = F.nll_loss(output, target)

        criterion = nn.CrossEntropyLoss()
        label = output.argmax(1)
        loss2 = criterion(output, label)

In this context

    for data, target in test_loader:
        data, target = data.to(device), target.to(device)

        data.requires_grad = True
        output = model(data)
        init_pred = output.max(1, keepdim=True)[1] 
        # get the index of the max log-probability

        # Calculate the loss
        loss1 = F.nll_loss(output, target)

        label = output.argmax(1)
        criterion = nn.CrossEntropyLoss()
        loss2 = criterion(output, label)

ptrblck · August 16, 2024, 9:11pm

They are not as seen in this example:

batch_size = 16
nb_classes = 64

# initialize the model output as logits
output = torch.randn(batch_size, nb_classes, requires_grad=True)
target = torch.randint(0, nb_classes, (batch_size,))

# this is wrong since log-probabilities are expected
loss1 = F.nll_loss(output, target)

criterion = nn.CrossEntropyLoss()
# this is also wrong since another target is used
label = output.argmax(1)
loss2 = criterion(output, label)

# compare
print(loss1 - loss2)
# tensor(-1.9827, grad_fn=<SubBackward0>)


# right approach
loss1 = F.nll_loss(F.log_softmax(output, 1), target)
loss2 = criterion(output, target)

print(loss1 - loss2)
# tensor(0., grad_fn=<SubBackward0>)

since F.nll_loss expects log-probabilities while nn.CrossEntropyLoss expects raw logits. You are also not using the same target in both cases.

EGanji · August 16, 2024, 9:20pm

Target and label both are 1-element tensors.
Loss1 and loss2 exactly give the same values.

ptrblck · August 16, 2024, 9:23pm

Using 1-element tensors for target and label (which is created via output.argmax(1)) indicates batch_size=1 as well as nb_classes=1, which is not a valid multi-class classification use case.
However, even then the losses differ and are not equal due to the reasons I’ve posted above:

batch_size = 1
nb_classes = 1

# initialize the model output as logits
output = torch.randn(batch_size, nb_classes, requires_grad=True)
target = torch.randint(0, nb_classes, (batch_size,))

# this is wrong since log-probabilities are expected
loss1 = F.nll_loss(output, target)

criterion = nn.CrossEntropyLoss()
# this is also wrong since another target is used
label = output.argmax(1)
loss2 = criterion(output, label)

# compare
print(loss1 - loss2)
# tensor(0.5912, grad_fn=<SubBackward0>)

EGanji · August 16, 2024, 9:31pm

nb_classes = 10, also target = label
That’s weird; this is what I got:

        loss1 = F.nll_loss(output, target)

        label = output.argmax(1)
        criterion = nn.CrossEntropyLoss()
        loss2 = criterion(output, label)
        print(loss1.detach(), loss2.detach(), (loss2 - loss1).detach())

Output:

tensor(0.1912) tensor(0.1912) tensor(0.)
tensor(0.0679) tensor(0.0679) tensor(0.)
tensor(0.0506) tensor(0.0506) tensor(0.)
tensor(0.2045) tensor(0.2045) tensor(0.)
tensor(0.0665) tensor(0.0665) tensor(0.)
tensor(0.1176) tensor(0.1176) tensor(0.)
tensor(0.1572) tensor(0.1572) tensor(0.)

I know,

loss1 is the loss between the model’s predicted probabilities (log-softmax outputs) and the true labels.
loss2 is the loss between the logits and the model’s predicted labels.