Hi everyone,

I’ve encountered an issue while training my model with a dataset that occasionally has samples with `None`

labels. To handle these cases, I set the loss to 0 whenever the label is `None`

by using `reduction="none"`

on the loss function. Here’s a simplified version of my approach:

```
import torch
from torch import optim, nn
from torch.utils.data import DataLoader
# Dummy data
x = torch.randn(100, 10)
y = torch.randn(100, 1)
# Set random values to NaN
nan_indexes = torch.randperm(100)[:10]
y[nan_indexes] = float("nan")
dl = DataLoader(list(zip(x, y)), batch_size=10)
# Model, loss function, and optimizer
model = nn.Sequential(nn.Linear(10, 10), nn.ReLU(), nn.Linear(10, 1))
criterion = nn.MSELoss(reduction="none")
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 20
for epoch in range(num_epochs):
for x, y in dl:
optimizer.zero_grad()
outputs = model(x)
loss = criterion(outputs, y)
# Handle None labels
nan_indexes = torch.isnan(y).nonzero()
loss[nan_indexes] = 0
loss = loss.sum() / (loss.shape[0] - nan_indexes.shape[0]) # Calculate the correct average loss
loss.backward()
optimizer.step()
print(loss.item())
```

The issue arises after a few steps: the gradient norms become NaN and loss is 0. Eventually, the model’s predictions also yield NaN and then the training crashes.

Here’s what I’ve tried:

- Verified that only the None labels are handled with the zero loss.
- Checked that the data inputs do not contain NaN values.

Despite these checks, the problem persists. Has anyone faced a similar issue or have any suggestions on how to properly handle None labels in the dataset without leading to NaN gradients?

Thanks in advance for your help!

Edit: It seems that if I zero out the label before calculating the output, I don’t get NaN. Is this the correct procedure if I want to ignore samples on the fly?