BCELoss vs BCEWithLogitsLoss

What is the advantage of using binary_cross_entropy_with_logits (aka BCE with sigmoid) over the regular binary_cross_entropy? I have a multi-binary classification problem and I’m trying to decide which one to choose.


As you described the only difference is the included sigmoid activation in nn.BCEWithLogitsLoss.
It’s comparable to nn.CrossEntropyLoss and nn.NLLLoss. While the former uses a nn.LogSoftmax activation function internally, you would have to add it in the latter criterion.


I understand the differences in the implementation, I don’t understand the theoretical advantages of using BCE with sigmoid vs without sigmoid.

Sorry for not being not clear enough.
The sigmoid activation should be applied in both cases.
While nn.BCEWithLogitsLoss will apply it internally for you, you should add it manually if you are using nn.BCELoss.


@ptrblck Isn’t it the other way around? I thought BCELoss needs to receive the outputs of Sigmoid activation as its input, but the other-one BCEWithLogitsLoss will need the logits as inputs instead of outputs of Sigmoid, since it will apply sigmoid internally.

Although, the example in the docs do not apply Sigmoid function prior to BCELoss:

### Example from pytorch-docs:
>>> m = nn.Sigmoid()
>>> loss = nn.BCELoss()
>>> input = torch.randn(3, requires_grad=True)
>>> target = torch.empty(3).random_(2)
>>> output = loss(m(input), target)
>>> output.backward()

So, I suppose the loss should be computed as

logits = m(input)
output = loss(torch.sigmoid(logits), target)

Is that right?


Yes, you are completely right and I’ve mixed up both names. I’ll edit my post to get it right.

In the example the nn.Sigmoid will be applied by m(input), so it should be right.
Your code should apply the sigmoid function twice (once in m() and the second time using torch.sigmoid).


Yes, that’s right. I somehow over-looked the definition of m.


1 Like

@Shani_Gamrian Use BCEWithLogitsLoss - it’s stable than using a plain Sigmoid followed by aBCELoss` (uses log-sum-exp trick for numerical stability)

see https://github.com/pytorch/pytorch/issues/751


Just to clarify, if using nn.BCEWithLogitsLoss(target, output), output should be passed through a sigmoid and only then to BCEWithLogitsLoss? I don’t understand why one would pass it through a sigmoid twice because x is already a probability after passing through one sigmoid.

No, that was a typo which @vmirly1 already corrected.
You should pass logits to nn.BCEwithLogitsLoss and probabilities (using sigmoid) to nn.BCELoss.

Also, make sure to pass the model output first and then the target to the criterion.


Sorry for asking my question here, I’m doing wod2vec with negative sampling and I had problem using nn.NLLLoss to train my network and I was reading pytorch loss functions, then I found out `binary_cross_entropy_with_logits, it says that This loss combines a Sigmoid layer and the BCELoss in one single class and This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets yy should be numbers between 0 and 1. So do you think I made a right choice for my loss function?

Negative sampling might work with nn.BCE(WithLogits)Loss, but might be inefficient, as you would probably calculate the non-reduced loss for all classes and mask them afterwards.
Some implementations sample the negative classes beforehand and calculate the bce loss manually, e.g. as described here.


but what exactly is the difference between the two? i mean how is BCEWithLogitsLoss more stable? what stability means here. It would be great if you could explain it to me in little more detail. Thanks in advance :smiley:

1 Like

Just to be super duper clear, do Not pass inputs through a sigmoid if using BCEWithLogitsLoss ?

And for the outputs, just use ‘1’ for True classes, or ‘0’ for the false classes?


  • logits -> nn.BCEWithLogitsLoss
  • logits -> sigmoid -> nn.BCELoss

That would be the standard definition, but you can basically define True/False, Positive/Negative as you wish.


Thanks everyone for your input.

If the network outputs logits and is trained with BCEWithLogitsLoss, during inference, should I use the logits directly as a probability or should I apply sigmoid first?

Now that I think about it, the logit could totally be negative or larger than 1, making it unsuitable as a probability. So I should apply a sigmoid during inference. Please confirm

y*ln(sig(x)) + (1 - y)*ln(1 - sig(x)) = y*ln(1 / (1 + e^-x))+(1 - y)*ln(1 - 1 / (1 + e^-x)) = ln(e^yx / (e^x + 1)) If I’m correct, it works like this.

The values of the logits might be harder to interpret, so you might want to apply a sigmoid to get the probabilities.
Note that a logit of 0 will map to p=0.5, so you could still easily get the prediction for this simple threshold with logits.

1 Like

sigmoid( ln(x/(1-x) ) ) = x

for both BCEWithLogitsLoss and CrossEntropyLoss ( 1 step )

we will need to do this when doing inferencing?

logps = model(img)
ps = torch.exp(logps)

Also, even if it’s 2steps(i.e logsoftmax + nlllosss) the above still applies right?