Persistent NaN loss

CangyuanLi · June 5, 2023, 8:28pm

Hi everyone, I am implementing a bi-directional LSTM to predict race (Asian, Black, Hispanic, White) from first name, last name, and the racial distribution of the person’s zip code. I am experiencing NaN loss (exploding gradients), despite trying all the common prescriptions. I have ensured that my data contains no null values and that all values are between 0 and 1. I have tried gradient clipping and a smaller learning rate (for example, I have tried lr =.0000000000000000001). I feel like there is something fundamentally wrong with my model, specifically with the way I input the racial distribution. I have implemented a bi-directional LSTM that uses only the person’s first name and last name, and that is running without issues. I was hoping someone could provide some insight.

Here is my model

class FirstLastZctaBiLSTM(nn.Module):
    def __init__(self, input_size: int, hidden_size: int, output_size: int):
        super(FirstLastZctaBiLSTM, self).__init__()

        self.hidden_size = hidden_size
        self.lstm = nn.LSTM(
            input_size, hidden_size, batch_first=True, bidirectional=True
        )
        self.h2o = nn.Linear(hidden_size * 2 + 4, output_size) # 4 is number of races
        self.softmax = nn.LogSoftmax(dim=1)

    def forward(
        self,
        name: torch.Tensor,
        pct: torch.Tensor,
        hidden: tuple[torch.Tensor, torch.Tensor],
    ):
        _, (h_n, _) = self.lstm(name, hidden) # grab the last hidden state
        h_n = torch.cat((h_n[-2], h_n[-1]), dim=1) # concatenate last hidden state of both directions

        # add in distribution at last time step
        # pct is tensor of size (batch_size, 4)
        combined = torch.cat((h_n, pct), dim=1) 

        output = self.h2o(combined)
        output: torch.Tensor = self.softmax(output)

        return output, hidden

    def init_hidden(self, batch_size: int):
        return (
            torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),
            torch.zeros(2, batch_size, self.hidden_size, device=DEVICE),
        )

Here is my training loop

    loss_function = nn.NLLLoss()
    opt = optim.SGD(
        model.parameters(),
        lr=.000001,
        momentum=.99,
        nesterov=True,
        weight_decay=.001,
    )

    for epoch in range(1, n_epochs + 1):
        for batch_no, (name, pct, race) in enumerate(dataloader, start=1):
            model.zero_grad(set_to_none=True)
            hidden = model.init_hidden(batch_size)
            output, _ = model(name, pct, hidden)
            loss: torch.Tensor = loss_function(output, race)
            loss.backward()

            if clip_value is not None:
                nn.utils.clip_grad_value_(model.parameters(), clip_value=1)

            opt.step()

Something I have also tried is winsorizing the distribution and normalizing the percents so that they sum to one, i.e. (.05, .1, .2, .2) → (0.0909, 0.18181, 0.3636, 0.3636), although this still doesn’t work.

eqy · June 5, 2023, 9:17pm

This is a bit tricky to debug without runnable code. Could you provide a runnable version with synthetic data that reproduces the issue?

CangyuanLi · June 5, 2023, 10:12pm

Thanks. Here is a small example: example - Google Drive. Please let me know of any issues!

eqy · June 5, 2023, 11:30pm

Are you sure that your data contains no null values?

I added this assert assert not (torch.isnan(name.data).any() or torch.isnan(pct.data).any()) in the forward call of your model and it appears to fire before the loss goes to NaN.

Adding
print(torch.isnan(pct).any()) to your dataset’s __getitem__ method shows that the NaN value is present at data loading time.

CangyuanLi · June 6, 2023, 1:02am

Thanks a lot. This is embarrassing…I have been learning Polars and checking with pl.col("x").is_null(), which, unlike in Pandas, is different from NaN (you should use pl.col("x").is_nan(). Thanks for taking the time to answer my question.