Why does LogSigmoid + NLLLoss fail while LogSoftmax + NLLLoss works?

So I have simple test network for MNIST data as following

def forward(self, x):
        x = F.relu(F.max_pool2d(self.conv1(x), 2))
        x = F.relu(F.max_pool2d(self.conv2_drop(self.conv2(x)), 2))
        x = x.view(-1, 80)
        x = F.relu(self.fc1(x))
        x = F.dropout(x, training=self.training)
        x = self.fc2(x)
        return F.XXXX(x)

I am using NLLLoss function to implement cross entropy loss explicitly
I wanted to understand how different act function affects the accuracy,
so I tried LogSoftmax and verify that the network trains but for some reason when I used LogSigmoid, the network fails to train.
(note that NLLLoss expects log probability)

Since softmax and sigmoid both have output value between 0 and 1.
I thought there shouldn’t be an issue.
Can anyone explain the detail I am not catching here?

Since both outputs the value between 0 and 1. I thought there shouldn’t be an issue.

the output of log sigmoid isn’t between 0 and 1, pls refer to docs

Sorry that was a mistake. I knew they are not.
I meant to say sigmod and softmax both output the value between 0 and 1.
post is updated

To me, the logsigmoid+NLLLoss combination hardly makes any sense, because the objective function only tries to promote the gt_class, but no suppression on the negative ones. Maybe you wanna try sigmoid+bceloss.

1 Like

Thank you so much for your advice.
the model trains with bce even with sigmoid

softmax works as well with bce but only up to a certain point and the training collapse.
I am not sure why but I guess it has something to do with the dependency among classes…

Do you know if there is a loss function which will be good for both activation function.

a naive answer: If you really want to test with a single loss function for both activation functions, what about L2 loss with one-hot vectors as target?

I’m not sure if it will give good performance though.

Softmax is actually not an activation function…
logsigmoid+nllloss doesn’t make sense mathematically (if you derive the gradients, you’ll find it.)

PyTroch documentation says NLLLoss expects log probability though
and that is logsigmoid + nllloss is how CrossEntropyLoss is constructed for PyTorch

nn.CrossEntropyLoss uses F.log_softmax and nn.NLLLoss internally as shown here.