Calculate loss between a target in one-hot encoded with class probabilities

please I have two tensors:

print(pred)
>> tensor([[0.05369152 0.09913312 0.5724052  0.13255161 0.041731   0.10048745]
 [0.13759595 0.13779168 0.11650976 0.27918294 0.1462693  0.18265037]
 [0.07497178 0.5721384  0.02145172 0.10340305 0.17965348 0.04838147]])
print(target)
>>tensor([[0 0 0 1 0 0]
 [0 1 0 0 0 0]
 [0 1 0 0 0 0]])

namely I have 6 classes: 0 to 5
I want to calculate loss between them, what type of loss function should I choose in this case (ex: nn.CrossEntropyLoss) ? and how ?
and thanks in advance

Hi Driss!

First let me suggest what you ought to do and then answer your specific
question:

You want pred to consist of (unnormalized) log-probabilities. To achieve
this you should have the final layer of your network be a Linear with
out_features = 6, not followed by Softmax (nor any other “activation”
layer). This is for improved numerical stability (compared with having pred be
probabilities, as you appear to have), but also because CrossEntropyLoss
requires pred to be log-probabilities.

You also want target to be integer class labels (with shape [nBatch] and
no class dimension). (You can use a one-hot encoded target but there is no
real point to doing so, and it is somewhat less efficient.) You can convert
target to integer class labels with target.argmax (dim = 1).

Now to answer your question in the specific context you give (but doing things
this way will be sub-optimal):

You must convert pred to log-probabilities by passing it through log().

You must also convert one-hot-encoded target from integers to floating-point
numbers, e.g., with float(). This is because CrossEntropyLoss works in two
modes: One is with integer class labels (of shape [nBatch]), and the other is
with (floating-point) probabilistic “soft” labels (of shape [nBatch, nClass]).
(One-hot-encoded labels can be understood to be a special case of probabilistic
labels where the probabilities happen to be 0.0 or 1.0.) CrossEntropyLoss
decides which mode to use based, in part, on whether target is integer or
floating-point. If you pass in an integer target, CrossEntropyLoss will try
to interpret it as integer class labels rather that as probabilistic “soft” labels
(of which one-hot-encoded labels are a special case).

Best.

K. Frank

please can you explain to me with an example by applying what you said about these two variables “target” and “pred”? because my real source code is very large… I tried to apply what you said all day, but without success.

Hi Driss!

I would suggest that you set up a toy example, using as example data the pred
and target you posted above, and attempt to pass them to CrossEntropyLoss
following the suggestions I made.

If you still have issues, post a small, fully-self-contained, runnable script that
shows what you’ve tried, together with its output. Let us know what you think
isn’t working and ask any specific questions you might have.

Best.

K. Frank

Can you provide some additional details on this, ideally in a way that a software engineer/non-mathematician can understand? Why does one want “numerical stability”? What effect does that have on training/inference? Since we are working with floating point numbers, there doesn’t seem like there should be any obvious advantage to using numbers inside a particular range vs some other range.

Hi Micah!

Even though floating-point numbers can represent a large range, that range
is finite, so floating-point numbers can “underflow” to zero and “overflow” to inf.

The problem is that when you have log-probabilities in a very reasonable range,
the process of using exp() to convert them to probabilities greatly expands that
range, making underflow and overflow much more likely.

Consider:

>>> import torch
>>> print (torch.__version__)
2.1.0
>>>
>>> # unnormalized log-probabilities in a very reasonable range
>>> log_prob = torch.tensor ([-150.0, -120.0, -100.0, -50.0, 0.0, 50.0, 100.0, 120.0, 150.0])
>>>
>>> # but they "saturate" to 0.0 and 1.0 when converted to probabilities
>>> log_prob.softmax (0)
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.7835e-44,
        1.9287e-22, 9.3576e-14, 1.0000e+00])
>>>
>>> # this is because softmax() uses exp() internally which underflows to 0.0 and overflows to inf
>>> log_prob.exp()
tensor([0.0000e+00, 0.0000e+00, 3.7835e-44, 1.9287e-22, 1.0000e+00, 5.1847e+21,
               inf,        inf,        inf])
>>>
>>> # we can use pytorch's log_softmax() to convert unnormalized log-probabilities to normalized log-probabilities
>>> log_prob.log_softmax (0)
tensor([-300., -270., -250., -200., -150., -100.,  -50.,  -30.,    0.])

If you’re training and you try to compute softmax() yourself, if your not careful
you will get infs and nans that will pollute your parameters with infs and
nans. Even if you use pytorch’s softmax(), when a probability saturates at
1.0 its gradient will become zero, and training will not progress. When a
probability saturates at 0.0, the log() inside of the cross-entropy function
will give you inf for your loss and your training will break down.

Avoiding this is what I meant by “numerical stability.”

Because cross-entropy uses the logs of the probabilities in its formula, it never
needs to compute that actual probabilities. Instead, it takes unnormalized
log-probabilities as its input and converts them to normalized log-probabilities
with log_softmax(). By leaving the predicted probabilities in “log-space,” so to
speak, CrossEntropyLoss essentially eliminates the possibility of zero gradients
and infs in this part of the computation.

Best.

K. Frank

1 Like

Very helpful, thank you K!

Given that softmax will lead to 0/inf, what is the best (safest) way to convert my model’s output into normal probabilities that I can use as output during model execution (not training)? Unitless numbers like -300 and 72 aren’t particularly useful in turning the model output into human understandable probabilities, and if I run it through softmax it sounds like I will run into the problem you described, and log_softmax will just give me more unitless numbers that don’t appear to have a meaningful range on them.

My expectation was that I would be able to have my model spit out numbers in the range [0,1] where the sum of the probabilities are 1, which is very easy for my feeble human brain to interpret meaningfully without any additional context.

Hi Micah!

softmax() does not lead to inf – it produces probabilities that range from
0.0 to 1.0. (exp() can lead to inf).

Pass your model’s output through softmax() to get probabilities (assuming
that your model has been trained to predict (unnormalized) log-probabilities,
typically by passing the output of your model into CrossEntropyLoss as the
loss criterion you use when training).

You will get perfectly sound probabilities. Just be aware that you can lose a
little bit of information when converting log-probabilities into probabilities in
that two probabilities that are different and both very close to zero (if you were
working with infinite precision) could both underflow to the same value of zero
even though they can be seen to be non-zero and different in log-space.
Similarly, a probability that is close to, but not equal to one could (when using
finite precision) saturate to one when converted from log-space to regular
probability space.

For many purposes this modest loss of information won’t matter.

Using softmax() to convert your model’s predictions to probabilities won’t
break anything – you’ll just potentially lose a little bit of information, as described
above.

Train with (unnormalized) log-probabilities as your predictions, but if you then
want to view those predictions as probabilities, pass them through softmax().

Best.

K. Frank

1 Like