Persistent NaN loss

Hi everyone, I am implementing a bi-directional LSTM to predict race (Asian, Black, Hispanic, White) from first name, last name, and the racial distribution of the person’s zip code. I am experiencing NaN loss (exploding gradients), despite trying all the common prescriptions. I have ensured that my data contains no null values and that all values are between 0 and 1. I have tried gradient clipping and a smaller learning rate (for example, I have tried lr =.0000000000000000001). I feel like there is something fundamentally wrong with my model, specifically with the way I input the racial distribution. I have implemented a bi-directional LSTM that uses only the person’s first name and last name, and that is running without issues. I was hoping someone could provide some insight.

Here is my model

class FirstLastZctaBiLSTM(nn.Module):
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        super(FirstLastZctaBiLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(
            input_size, hidden_size, batch_first=True, bidirectional=True
        self.h2o = nn.Linear(hidden_size * 2 + 4, output_size) # 4 is number of races
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(
        name: torch.Tensor,
        pct: torch.Tensor,
        hidden: tuple[torch.Tensor, torch.Tensor],
        _, (h_n, _) = self.lstm(name, hidden) # grab the last hidden state
        h_n =[-2], h_n[-1]), dim=1) # concatenate last hidden state of both directions

        # add in distribution at last time step
        # pct is tensor of size (batch_size, 4)
        combined =, pct), dim=1) 

        output = self.h2o(combined)
        output: torch.Tensor = self.softmax(output)

        return output, hidden

    def init_hidden(self, batch_size: int):
        return (
            torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),
            torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),

Here is my training loop

    loss_function = nn.NLLLoss()
    opt = optim.SGD(

    for epoch in range(1, n_epochs + 1):
        for batch_no, (name, pct, race) in enumerate(dataloader, start=1):
            hidden = model.init_hidden(batch_size)
            output, _ = model(name, pct, hidden)
            loss: torch.Tensor = loss_function(output, race)

            if clip_value is not None:
                nn.utils.clip_grad_value_(model.parameters(), clip_value=1)


Something I have also tried is winsorizing the distribution and normalizing the percents so that they sum to one, i.e. (.05, .1, .2, .2) → (0.0909, 0.18181, 0.3636, 0.3636), although this still doesn’t work.

This is a bit tricky to debug without runnable code. Could you provide a runnable version with synthetic data that reproduces the issue?

Thanks. Here is a small example: example - Google Drive. Please let me know of any issues!

Are you sure that your data contains no null values?

I added this assert assert not (torch.isnan( or torch.isnan( in the forward call of your model and it appears to fire before the loss goes to NaN.

print(torch.isnan(pct).any()) to your dataset’s __getitem__ method shows that the NaN value is present at data loading time.


Thanks a lot. This is embarrassing…I have been learning Polars and checking with pl.col("x").is_null(), which, unlike in Pandas, is different from NaN (you should use pl.col("x").is_nan():man_facepalming:. Thanks for taking the time to answer my question.

1 Like