# Calculate loss between a target in one-hot encoded with class probabilities

``````print(pred)
>> tensor([[0.05369152 0.09913312 0.5724052  0.13255161 0.041731   0.10048745]
[0.13759595 0.13779168 0.11650976 0.27918294 0.1462693  0.18265037]
[0.07497178 0.5721384  0.02145172 0.10340305 0.17965348 0.04838147]])
print(target)
>>tensor([[0 0 0 1 0 0]
[0 1 0 0 0 0]
[0 1 0 0 0 0]])
``````

namely I have 6 classes: 0 to 5
I want to calculate loss between them, what type of loss function should I choose in this case (ex: `nn.CrossEntropyLoss`) ? and how ?

Hi Driss!

First let me suggest what you ought to do and then answer your specific
question:

You want `pred` to consist of (unnormalized) log-probabilities. To achieve
this you should have the final layer of your network be a `Linear` with
`out_features = 6`, not followed by `Softmax` (nor any other â€śactivationâ€ť
layer). This is for improved numerical stability (compared with having `pred` be
probabilities, as you appear to have), but also because `CrossEntropyLoss`
requires `pred` to be log-probabilities.

You also want `target` to be integer class labels (with shape `[nBatch]` and
no class dimension). (You can use a one-hot encoded `target` but there is no
real point to doing so, and it is somewhat less efficient.) You can convert
`target` to integer class labels with `target.argmax (dim = 1)`.

Now to answer your question in the specific context you give (but doing things
this way will be sub-optimal):

You must convert `pred` to log-probabilities by passing it through `log()`.

You must also convert one-hot-encoded `target` from integers to floating-point
numbers, e.g., with `float()`. This is because `CrossEntropyLoss` works in two
modes: One is with integer class labels (of shape `[nBatch]`), and the other is
with (floating-point) probabilistic â€śsoftâ€ť labels (of shape `[nBatch, nClass]`).
(One-hot-encoded labels can be understood to be a special case of probabilistic
labels where the probabilities happen to be `0.0` or `1.0`.) `CrossEntropyLoss`
decides which mode to use based, in part, on whether `target` is integer or
floating-point. If you pass in an integer `target`, `CrossEntropyLoss` will try
to interpret it as integer class labels rather that as probabilistic â€śsoftâ€ť labels
(of which one-hot-encoded labels are a special case).

Best.

K. Frank

please can you explain to me with an example by applying what you said about these two variables â€śtargetâ€ť and â€śpredâ€ť? because my real source code is very largeâ€¦ I tried to apply what you said all day, but without success.

Hi Driss!

I would suggest that you set up a toy example, using as example data the `pred`
and `target` you posted above, and attempt to pass them to `CrossEntropyLoss`

If you still have issues, post a small, fully-self-contained, runnable script that
shows what youâ€™ve tried, together with its output. Let us know what you think
isnâ€™t working and ask any specific questions you might have.

Best.

K. Frank

Can you provide some additional details on this, ideally in a way that a software engineer/non-mathematician can understand? Why does one want â€śnumerical stabilityâ€ť? What effect does that have on training/inference? Since we are working with floating point numbers, there doesnâ€™t seem like there should be any obvious advantage to using numbers inside a particular range vs some other range.

Hi Micah!

Even though floating-point numbers can represent a large range, that range
is finite, so floating-point numbers can â€śunderflowâ€ť to zero and â€śoverflowâ€ť to `inf`.

The problem is that when you have log-probabilities in a very reasonable range,
the process of using `exp()` to convert them to probabilities greatly expands that
range, making underflow and overflow much more likely.

Consider:

``````>>> import torch
>>> print (torch.__version__)
2.1.0
>>>
>>> # unnormalized log-probabilities in a very reasonable range
>>> log_prob = torch.tensor ([-150.0, -120.0, -100.0, -50.0, 0.0, 50.0, 100.0, 120.0, 150.0])
>>>
>>> # but they "saturate" to 0.0 and 1.0 when converted to probabilities
>>> log_prob.softmax (0)
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.7835e-44,
1.9287e-22, 9.3576e-14, 1.0000e+00])
>>>
>>> # this is because softmax() uses exp() internally which underflows to 0.0 and overflows to inf
>>> log_prob.exp()
tensor([0.0000e+00, 0.0000e+00, 3.7835e-44, 1.9287e-22, 1.0000e+00, 5.1847e+21,
inf,        inf,        inf])
>>>
>>> # we can use pytorch's log_softmax() to convert unnormalized log-probabilities to normalized log-probabilities
>>> log_prob.log_softmax (0)
tensor([-300., -270., -250., -200., -150., -100.,  -50.,  -30.,    0.])
``````

If youâ€™re training and you try to compute `softmax()` yourself, if your not careful
you will get `inf`s and `nan`s that will pollute your parameters with `inf`s and
`nan`s. Even if you use pytorchâ€™s `softmax()`, when a probability saturates at
`1.0` its gradient will become zero, and training will not progress. When a
probability saturates at `0.0`, the `log()` inside of the cross-entropy function
will give you `inf` for your loss and your training will break down.

Avoiding this is what I meant by â€śnumerical stability.â€ť

Because cross-entropy uses the logs of the probabilities in its formula, it never
needs to compute that actual probabilities. Instead, it takes unnormalized
log-probabilities as its input and converts them to normalized log-probabilities
with `log_softmax()`. By leaving the predicted probabilities in â€ślog-space,â€ť so to
speak, `CrossEntropyLoss` essentially eliminates the possibility of zero gradients
and `inf`s in this part of the computation.

Best.

K. Frank

1 Like

Given that softmax will lead to `0`/`inf`, what is the best (safest) way to convert my modelâ€™s output into normal probabilities that I can use as output during model execution (not training)? Unitless numbers like `-300` and `72` arenâ€™t particularly useful in turning the model output into human understandable probabilities, and if I run it through softmax it sounds like I will run into the problem you described, and log_softmax will just give me more unitless numbers that donâ€™t appear to have a meaningful range on them.

My expectation was that I would be able to have my model spit out numbers in the range [0,1] where the sum of the probabilities are 1, which is very easy for my feeble human brain to interpret meaningfully without any additional context.

Hi Micah!

`softmax()` does not lead to `inf` â€“ it produces probabilities that range from
`0.0` to `1.0`. (`exp()` can lead to `inf`).

Pass your modelâ€™s output through `softmax()` to get probabilities (assuming
that your model has been trained to predict (unnormalized) log-probabilities,
typically by passing the output of your model into `CrossEntropyLoss` as the
loss criterion you use when training).

You will get perfectly sound probabilities. Just be aware that you can lose a
little bit of information when converting log-probabilities into probabilities in
that two probabilities that are different and both very close to zero (if you were
working with infinite precision) could both underflow to the same value of zero
even though they can be seen to be non-zero and different in log-space.
Similarly, a probability that is close to, but not equal to one could (when using
finite precision) saturate to one when converted from log-space to regular
probability space.

For many purposes this modest loss of information wonâ€™t matter.

Using `softmax()` to convert your modelâ€™s predictions to probabilities wonâ€™t
break anything â€“ youâ€™ll just potentially lose a little bit of information, as described
above.

Train with (unnormalized) log-probabilities as your predictions, but if you then
want to view those predictions as probabilities, pass them through `softmax()`.

Best.

K. Frank

1 Like