Is log_softmax + NLLLoss == CrossEntropyLoss?

Vanjoy · November 1, 2017, 5:10am

If I’m not missing something, they should be the same. However, I tried the follow snippet, but they are not equal.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torch.autograd import Variable


class Net(nn.Module):

    def __init__(self, n_features, n_hiddens, n_classes):
        super(Net, self).__init__()
        self.gru = torch.nn.GRU(n_features, n_hiddens)
        self.linear = torch.nn.Linear(n_hiddens, n_classes)

    def forward(self, x, flag=True):
        o, h = self.gru(x)
        o = self.linear(o)

        if flag:
            o = F.log_softmax(o)

        return o


n_steps = 10
n_classes = 100
mb_size = 32
n_features = 50
n_hiddens = 60

net = Net(n_features, n_hiddens, n_classes)

loss1 = torch.nn.NLLLoss(size_average=False)
loss2 = torch.nn.CrossEntropyLoss(size_average=False)

x = Variable(torch.rand(n_steps, mb_size, n_features))
y = Variable(
    torch.LongTensor(np.random.randint(0, n_classes, (n_steps, mb_size))))

logits1 = net(x, flag=True).view(-1, n_classes)
logits2 = net(x, flag=False).view(-1, n_classes)

loss_val1 = loss1(logits1, y.view(-1))
loss_val2 = loss2(logits2, y.view(-1))

print(loss_val1)
print(loss_val2)

richard · November 1, 2017, 8:44pm

They are the same (see the implementation). I think the reason why it isn’t working out for you because log_softmax gives different results depending on shape. The shape of x when passed into log_softmax in forward is different from the shape of logit2.

Vanjoy · November 2, 2017, 1:25am

Thank you so much. I can’t believe that iPython gives me

I don’t even know about there’s a dim param in the log_softmax function.

colesbury · November 2, 2017, 2:27am

The dim parameter is new and will be in the next release. The docs are fixed too. Here’s what it says in master, if you build from source:

In [5]: ?torch.nn.functional.log_softmax
Signature: torch.nn.functional.log_softmax(input, dim=None, _stacklevel=3)
Docstring:
Applies a softmax followed by a logarithm.

While mathematically equivalent to log(softmax(x)), doing these two
operations separately is slower, and numerically unstable. This function
uses an alternative formulation to compute the output and gradient correctly.

See :class:`~torch.nn.LogSoftmax` for more details.

Arguments:
    input (Variable): input
    dim (int): A dimension along which log_softmax will be computed.
File:      /data/users/sgross/pytorch/torch/nn/functional.py
Type:      function

You can always use docs.pytorch.org

Vanjoy · November 2, 2017, 2:38am

Thank you for you information.

Albert_Chen · July 22, 2018, 1:58pm

Yes, according to the official website
CrossEntroyLoss definition
they are equivalent

Oleksandra_Sopova · November 29, 2018, 5:57pm

Could you elaborate on “log_softmax gives different results depending on shape”? I’ve printed the shapes and they look the same.

bigc · October 22, 2022, 7:13am

This code works for me:

You need to specify o = F.log_softmax(o, dim=-1)


import numpy as np
import torch
import torch.nn.functional as F
from torch import nn
from torch.autograd import Variable


class Net(nn.Module):

    def __init__(self, n_features, n_hiddens, n_classes):
        super(Net, self).__init__()
        self.gru = torch.nn.GRU(n_features, n_hiddens)
        self.linear = torch.nn.Linear(n_hiddens, n_classes)

    def forward(self, x, flag=True):
        o, h = self.gru(x)
        o = self.linear(o)

        if flag:
            o = F.log_softmax(o, dim=-1)

        return o


n_steps = 10
n_classes = 100
mb_size = 32
n_features = 50
n_hiddens = 60

net = Net(n_features, n_hiddens, n_classes)

loss1 = torch.nn.NLLLoss(size_average=False)
loss2 = torch.nn.CrossEntropyLoss(size_average=False)

x = Variable(torch.rand(n_steps, mb_size, n_features))
y = Variable(
    torch.LongTensor(np.random.randint(0, n_classes, (n_steps, mb_size))))

logits1 = net(x, flag=True).view(-1, n_classes)
logits2 = net(x, flag=False).view(-1, n_classes)

loss_val1 = loss1(logits1, y.view(-1))
loss_val2 = loss2(logits2, y.view(-1))

print(loss_val1)
print(loss_val2)