Simply because without the sigmoid activation your model will give you logits that are not guaranteed to be bounded between 0 and 1.
As the name implies BCEWithLogitsLoss
can compute binary cross-entropy from the raw logits while the BCELoss
needs a binary Tensor as mentioned in the docs (BCELoss — PyTorch 2.1 documentation)
See past discussion here: BCELoss vs BCEWithLogitsLoss
So there are two options:
model(input)
→ logits →BCEWithLogitsLoss
→ lossmodel(input)
→ logits → F.sigmoid →BCELoss
→ loss
I would recommend using the same steps during both training and test to avoid discrepancies