Loss outputting nan

Hi,

I’m trying to build a simple NN for a categorical classification problem, however, i’m not able to get any value out of the losses. I’m not sure if something is wrong with my layers or it’s just some syntax issue.

My dataset is a 5518x512 tensor (5518 observations, 512 features per observation) and my labels is a categorical 5518x1 tensor, which i converted into a 5518x5 one hot encoded tensor (5 classes).

My model is as follows:

class NN(tt.nn.Module):
    def __init__(self):
        super().__init__()
        self.dense = tt.nn.Linear(in_features = 512, out_features = 5)
    def forward(self,x):
        y = self.dense(x)
        return y

My optimizer and loss fxns:

criterion = tt.nn.CrossEntropyLoss()
optimizer = tt.optim.SGD(model.parameters(), lr=0.0001)

Now, when i initialize the model and do a forward pass, it works perfect, but, when i calculate the losses, i get nan, every time, even in the first iteration.

Example, one forward pass gives me:

>>> yp
tensor([[-0.0309, -0.0312, -0.0166, -0.0349,  0.0427],
        [-0.0257, -0.0310, -0.0115, -0.0375,  0.0446],
        [-0.0281, -0.0321, -0.0115, -0.0370,  0.0461],
        ...,
        [-0.0266, -0.0240, -0.0497, -0.0416,  0.0226],
        [-0.0145, -0.0241, -0.0463, -0.0558,  0.0208],
        [-0.0247, -0.0249, -0.0480, -0.0471,  0.0220]], device='cuda:0', grad_fn=<AddmmBackward0>)

And my training labels are:

>>> y
tensor([[1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        [1., 0., 0., 0., 0.],
        ...,
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.],
        [0., 0., 0., 0., 1.]], device='cuda:0')

And after calculating the losses (criterion(yp,y)), i always get:

tensor(nan, device='cuda:0', grad_fn <DivBackward1>)

Any idea what could it be?

Thanks in advance!

PS: I’m still not sure if this is the best way to do a categorical classification (encoding, type of nn, etc), i’ve found many very different solutions and i’m not sure which one would be the best one (any advice is more than welcome)

Your model is very simple but should work as demo.

Your inputs (x) probably need to be normalized. If you have a large intensity range it can cause numerical instability.

Look at min-max normalization or z-score normalization.

Indeed i didn’t normalize them, i’ll try that in a while, however, all the values are small (basically all the features for an observation add up to 1).

I understand that can cause a numerical instability and convergence issues, but why would it be outputting nan in the losses?

EDIT: This is from the first iteration, before doing anything to the grads, like, literally the example i wrote.

Hi Ghost!

Most likely, you have a nan in your data somewhere. First check
whether yp or y have nans or infs in them, and, if so, work backwards
to find out what causes them.

Note that the tensors you posted are elided for display purposes. But
using your elided tensors, you can see the CrossEntropyLoss works
just fine with them:

>>> import torch
>>> print (torch.__version__)
2.3.1
>>>
>>> # your elided tensors
>>>
>>> yp = torch.tensor ([[-0.0309, -0.0312, -0.0166, -0.0349,  0.0427],
...                     [-0.0257, -0.0310, -0.0115, -0.0375,  0.0446],
...                     [-0.0281, -0.0321, -0.0115, -0.0370,  0.0461],
...                     [-0.0266, -0.0240, -0.0497, -0.0416,  0.0226],
...                     [-0.0145, -0.0241, -0.0463, -0.0558,  0.0208],
...                     [-0.0247, -0.0249, -0.0480, -0.0471,  0.0220]]
... )
>>>
>>> y  = torch.tensor ([[1., 0., 0., 0., 0.],
...                     [1., 0., 0., 0., 0.],
...                     [1., 0., 0., 0., 0.],
...                     [0., 0., 0., 0., 1.],
...                     [0., 0., 0., 0., 1.],
...                     [0., 0., 0., 0., 1.]]
... )
>>>
>>> torch.nn.CrossEntropyLoss() (yp, y)
tensor(1.5945)

If you still have trouble after searching for a nan somewhere in your
input, please post a super-simplified (no tensor dimensions of size
5518 or 512, please), fully-self-contained, runnable script that illustrates
your issue, together with the output you get when you run it.

First, your neural network consists of a single Linear layer with no
non-linearities. As such, it won’t really be able to learn anything (except
when working with contrived linear problems). Try using at least two
Linear layers (perhaps more), separated by non-linear “activations”
such as ReLU or Tanh.

Second, use integer categorical labels (rather than converting them to
one-hot format). Your prediction (yp) will have shape [nBatch, nClass]
and your target (y) will have shape [nBatch] (with no class dimension)
and should be of type LongTensor.

(You can use a floating-point, one-hot target of shape [nBatch, nClass],
but it’s extra work and there’s no point to it.)

Best.

K. Frank

1 Like

Thanks! Indeed, i was testing some operations on the input tensor and was getting nan also, so i was indeed suspecting nan’s or some issue with the input type. I’m gonna test tomorrow and report back!

Thank again!

Yep, it was that indeed… i had two rows full of nans in the dataset; tracked the issue back and it was some problem with the function i was using to generate the data.

Now, for the categorical labels instead of one-hot, in the output in this case, the predicted categories are just the torch.max(yp,dim=1), right? in that case, how do i set that in my training loop so the losses knows that it must get the max? Do i have to use a softmax layer?

Hi Ghost!

If I understand what you are asking, no.

For training, you want your predictions (that you then feed into your
CrossEntropyLoss loss function) to be unnormalized log-probabilities.
These are what you naturally get out of a final Linear layer with
out_features = num_classes (without any following Softmax).

Again, if I understand what you’re asking, you don’t want your training
loop to “get the max.” Just use the raw output of your final Linear layer.
(Note, CrossEntropyLoss has log_softmax() built into it.)

For various evaluation metrics, for example, the accuracy (the fraction
of predictions your model gets right), you want to turn the “probabilistic”
predictions produced by your final Linear layer into “hard,” single-class
predictions. To get these, compute the argmax() of the output of the
final Linear layer. Doing so gives you integer categorical class label
predicted by your model (which you would then test for equality with your
single-integer training labels (without any one-hot)).

Best.

K. Frank

Perfect, so, setting the last layer of my forward pass to an output dim of size nclasses and giving that along the labels (with dim = 1) should be interpreted correctly by the losses, right?.