Hi Micah!

Even though floating-point numbers can represent a large range, that range

is finite, so floating-point numbers can â€śunderflowâ€ť to zero and â€śoverflowâ€ť to `inf`

.

The problem is that when you have log-probabilities in a very reasonable range,

the process of using `exp()`

to convert them to probabilities greatly expands that

range, making underflow and overflow much more likely.

Consider:

```
>>> import torch
>>> print (torch.__version__)
2.1.0
>>>
>>> # unnormalized log-probabilities in a very reasonable range
>>> log_prob = torch.tensor ([-150.0, -120.0, -100.0, -50.0, 0.0, 50.0, 100.0, 120.0, 150.0])
>>>
>>> # but they "saturate" to 0.0 and 1.0 when converted to probabilities
>>> log_prob.softmax (0)
tensor([0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 0.0000e+00, 3.7835e-44,
1.9287e-22, 9.3576e-14, 1.0000e+00])
>>>
>>> # this is because softmax() uses exp() internally which underflows to 0.0 and overflows to inf
>>> log_prob.exp()
tensor([0.0000e+00, 0.0000e+00, 3.7835e-44, 1.9287e-22, 1.0000e+00, 5.1847e+21,
inf, inf, inf])
>>>
>>> # we can use pytorch's log_softmax() to convert unnormalized log-probabilities to normalized log-probabilities
>>> log_prob.log_softmax (0)
tensor([-300., -270., -250., -200., -150., -100., -50., -30., 0.])
```

If youâ€™re training and you try to compute `softmax()`

yourself, if your not careful

you will get `inf`

s and `nan`

s that will pollute your parameters with `inf`

s and

`nan`

s. Even if you use pytorchâ€™s `softmax()`

, when a probability saturates at

`1.0`

its gradient will become zero, and training will not progress. When a

probability saturates at `0.0`

, the `log()`

inside of the cross-entropy function

will give you `inf`

for your loss and your training will break down.

Avoiding this is what I meant by â€śnumerical stability.â€ť

Because cross-entropy uses the logs of the probabilities in its formula, it never

needs to compute that actual probabilities. Instead, it takes unnormalized

log-probabilities as its input and converts them to normalized log-probabilities

with `log_softmax()`

. By leaving the predicted probabilities in â€ślog-space,â€ť so to

speak, `CrossEntropyLoss`

essentially eliminates the possibility of zero gradients

and `inf`

s in this part of the computation.

Best.

K. Frank