What is the advantage of using `binary_cross_entropy_with_logits`

(aka BCE with sigmoid) over the regular `binary_cross_entropy`

? I have a multi-binary classification problem and I’m trying to decide which one to choose.

As you described the only difference is the included sigmoid activation in `nn.BCEWithLogitsLoss`

.

It’s comparable to `nn.CrossEntropyLoss`

and `nn.NLLLoss`

. While the former uses a `nn.LogSoftmax`

activation function internally, you would have to add it in the latter criterion.

I understand the differences in the implementation, I don’t understand the theoretical advantages of using BCE with sigmoid vs without sigmoid.

Sorry for not being not clear enough.

The sigmoid activation should be applied in both cases.

While `nn.BCEWithLogitsLoss`

will apply it internally for you, you should add it manually if you are using `nn.BCELoss`

.

@ptrblck Isn’t it the other way around? I thought `BCELoss`

needs to receive the outputs of `Sigmoid`

activation as its input, but the other-one `BCEWithLogitsLoss`

will need the logits as inputs instead of outputs of `Sigmoid`

, since it will apply sigmoid internally.

Although, the example in the docs do not apply Sigmoid function prior to BCELoss:

```
### Example from pytorch-docs:
>>> m = nn.Sigmoid()
>>> loss = nn.BCELoss()
>>> input = torch.randn(3, requires_grad=True)
>>> target = torch.empty(3).random_(2)
>>> output = loss(m(input), target)
>>> output.backward()
```

So, I suppose the loss should be computed as

```
logits = m(input)
output = loss(torch.sigmoid(logits), target)
```

Is that right?

Yes, you are completely right and I’ve mixed up both names. I’ll edit my post to get it right.

In the example the `nn.Sigmoid`

will be applied by `m(input)`

, so it should be right.

Your code should apply the sigmoid function twice (once in `m()`

and the second time using `torch.sigmoid`

).

Yes, that’s right. I somehow over-looked the definition of `m`

.

Thanks

@Shani_Gamrian Use BCEWithLogitsLoss - it’s stable than using a plain `Sigmoid followed by a`

BCELoss` (uses log-sum-exp trick for numerical stability)

Just to clarify, if using `nn.BCEWithLogitsLoss(target, output)`

, `output`

should be passed through a sigmoid and only then to `BCEWithLogitsLoss`

? I don’t understand why one would pass it through a sigmoid twice because x is already a probability after passing through one sigmoid.

No, that was a typo which @vmirly1 already corrected.

You should pass logits to `nn.BCEwithLogitsLoss`

and probabilities (using sigmoid) to `nn.BCELoss`

.

Also, make sure to pass the model output first and then the target to the criterion.

Sorry for asking my question here, I’m doing wod2vec with negative sampling and I had problem using nn.NLLLoss to train my network and I was reading pytorch loss functions, then I found out `binary_cross_entropy_with_logits, it says that This loss combines a Sigmoid layer and the BCELoss in one single class and This is used for measuring the error of a reconstruction in for example an auto-encoder. Note that the targets yy should be numbers between 0 and 1. So do you think I made a right choice for my loss function?

Thanks

Negative sampling might work with `nn.BCE(WithLogits)Loss`

, but might be inefficient, as you would probably calculate the non-reduced loss for all classes and mask them afterwards.

Some implementations sample the negative classes beforehand and calculate the bce loss manually, e.g. as described here.

but what exactly is the difference between the two? i mean how is BCEWithLogitsLoss more stable? what stability means here. It would be great if you could explain it to me in little more detail. Thanks in advance

Just to be super duper clear, do **Not** pass inputs through a sigmoid if using `BCEWithLogitsLoss`

?

And for the outputs, just use ‘1’ for True classes, or ‘0’ for the false classes?

Yes.

- logits ->
`nn.BCEWithLogitsLoss`

- logits -> sigmoid ->
`nn.BCELoss`

That would be the standard definition, but you can basically define True/False, Positive/Negative as you wish.

Thanks everyone for your input.

If the network outputs logits and is trained with BCEWithLogitsLoss, during inference, should I use the logits directly as a probability or should I apply sigmoid first?

edit:

Now that I think about it, the logit could totally be negative or larger than 1, making it unsuitable as a probability. So I should apply a sigmoid during inference. Please confirm

`y*ln(sig(x)) + (1 - y)*ln(1 - sig(x))`

= `y*ln(1 / (1 + e^-x))+(1 - y)*ln(1 - 1 / (1 + e^-x))`

= `ln(e^yx / (e^x + 1))`

If I’m correct, it works like this.

The values of the logits might be harder to interpret, so you might want to apply a sigmoid to get the probabilities.

Note that a logit of 0 will map to `p=0.5`

, so you could still easily get the prediction for this simple threshold with logits.

sigmoid( ln(x/(1-x) ) ) = x

for both BCEWithLogitsLoss and CrossEntropyLoss ( 1 step )

we will need to do this when doing inferencing?

```
logps = model(img)
ps = torch.exp(logps)
```

Also, even if it’s 2steps(i.e logsoftmax + nlllosss) the above still applies right?

Thanks