Just as the title, I must use the result of softmax,then I want to use a loss.

I found that NLLLoss must be after log_softmax,if I just compute a log for the result of softmax,is that right?

As for nn.CrossEntropyLoss(),there can’t be a softmax.

Could you please tell me which loss should I choose?

`nn.CrossEntropyLoss`

combines `log_softmax`

and `NLLLoss`

which means you should not apply `softmax`

at the end of your network output.

So you are not required to apply softmax since the criterion takes care of it.

If you want to use `softmax`

at the end, then you should apply log after that(as you mentioned above) and use `NLLLoss`

as the criterion.

If I do that, wouldn’t back propagation be a problem?

If I do that, wouldn’t back propagation be a problem?

Doing what will cause problem during backprop?

I have no idea if I seperately use the softmax and log function instead of log_softmax,whether there will be a problem when BP.

I don’t think it will cause any problem. It’s still same as using `log_softmax`

. Maybe you can test your custom function just to make sure if it is consistent with `log_softmax`

.

Hi Raghul and Chunchun!

Just to clarify:

`log (softmax())`

is *mathematically* the same as `log_softmax()`

,

but they differ numerically. `softmax()`

calculates exponentials that

can “blow numbers up.” The `log()`

then undoes this, but the damage

can already be done. So `log (softmax())`

can be numerically unstable,

leading to reduced precision and `nan`

s, and can cause problems.

`log_softmax()`

(largely) avoids this by reorganizing the calculation

so that the intermediate blow-up doesn’t occur. (That’s why pytorch

(and other packages) include it as separate function.)

There is usually no reason to use `softmax()`

. Just feed the last linear

layer of your network (that you would have fed into `softmax()`

) into

`cross_entropy()`

as your loss function (or use `log_softmax()`

followed by `nll_loss()`

).

If somebody *forces* you to use `softmax()`

then you’re stuck, and

have to deal with the potential numerical instability of of `softmax()`

followed by `log()`

…

Good luck!

K. Frank

Thanks a lot for your answer.

That’s very clear,but I must use the layer which can supply propabilities.

Maybe I can use sigmoid+BCELoss?

Hello Chunchun!

In general, there is no particular need to use probabilities *to feed
into your loss function.*

If your use case requires probabilities for some other reason,

perhaps you could explain why you need them and what you

need to use them for.

For training, you should use (based on what you’ve said so far)

a linear layer that outputs numbers from `-inf`

to `+inf`

(that are

to be understood as *logits*) fed into `cross_entropy()`

as your

loss function. This will all be part of “autograd” and you will

back-propagate through it.

Then, if you need actual probabilities for some other reason,

take the outputs of your linear layer, and using

`with torch. no_grad():`

so you don’t affect your gradient

calculation, run them through `softmax()`

to convert the logits

to the probabilities you want.

`BCELoss`

(binary cross-entropy) is, *in essence,* the special two-class

case of the multi-class `cross_entropy()`

loss.

`sigmoid()`

--> `BCELoss`

has the same numerical problems as

`softmax()`

--> `log()`

--> `nll_loss()`

. If you are performing a

binary (two-class) classification problem, you will want to feed

the (single) output of your last linear layer into

`binary_cross_entropy_with_logits()`

(`BCEWithLogitsLoss`

).

(This is the binary analog of `cross_entropy()`

(`CrossEntropyLoss`

).)

And again, if you *need* the actual probability (which you don’t for

training), you would run the output of your last linear layer through

`sigmoid()`

(under `with torch. no_grad():`

) to get the probability.

Good luck!

K. Frank

Thank you very much!