In theory, these two loss should have same value, since they are both binary classification loss. Why the loss value is actually different, with one to be 2.01 and the other to be 0.77 ?

BCEWithLogitsLoss expects a single (real) number per sample
that indicates the â€śstrengthâ€ť of that sample being in the â€ś1â€ť
state (the â€śyesâ€ť state, if you will).

To recover the loss you get with CrossEntropyLoss you need to
pass in the difference of your state-1 and state-0 strengths.

This code performs the calculation I think you want:
(For simplicity, Iâ€™ve removed two of your dimensions; the labels
are now a vector of five samples, with labels.shape = [5].)

Classic cross-entropy loss measures the mismatch between
two (discrete) probability distributions. So, for the binary case,
you compare (Q(â€śnoâ€ť state), Q(â€śyesâ€ť state)) with (P(â€śnoâ€ť state),
P(â€śyesâ€ť state)), where P(â€śnoâ€ť state) is the actual (â€śground
truthâ€ť) probability that your sample is in the â€śnoâ€ť state, while
Q(â€śnoâ€ť state) is your modelâ€™s prediction of this probability.

(As probabilities, they are all between 0 and 1, and P(â€śnoâ€ť) +
P(â€śyesâ€ť) = 1, and similarly for the Qs.)

Pytorchâ€™s CrossEntropyLoss has a built-in Softmax that coverts
your modelâ€™s predicted â€śstrengthsâ€ť (relative log-odds-ratios)
into probabilities that sum to one. It also one-hots your labels
so that (in the binary case) label = 1 turns into P(â€śnoâ€ť) = 0,
and P(â€śyesâ€ť) = 1. It then calculates the cross-entropy of these
two probability distributions.

BCELoss calculates this same cross-entropy, but it knows that
itâ€™s the binary case, so you only give it one of the two
probabilities, Q(â€śyesâ€ť), and you can understand the 0 and 1
labels as simply being the values of P(â€śyesâ€ť).

This is illustrated by further running the following code:

Just as CrossEntropyLoss has a built-in Softmax (to convert
â€śstrengthsâ€ť to probabilities), BCEWithLogitsLoss has a built-in
logistic function (Sigmoid) to convert the â€śstrengthâ€ť of the â€śyesâ€ť
state into the probability Q(â€śyesâ€ť). More precisely, the â€śstrengthâ€ť
is the log-odds-ratio of the â€śyesâ€ť state, also called the â€ślogitâ€ť.
That is, BCEWithLogitsLoss expects logit(Q(â€śyesâ€ť)) as its input,
and the built-in Sigmoid converts it back to Q(â€śyesâ€ť).