Function 'MseLossBackward' returned nan values in its 0th output

I am fighting this error for few hours now.
It happens during the first epoch after at about 5 minutes of running of train (about 100 batches). ipdb is not stopping. I checked the inputs, it has no Nans.

This is the loss. I tried to wrap it with ipdb to catch the error, but it is not triggered.

import ipdb

class RMSLELoss(nn.Module):
    def __init__(self):
        super().__init__()
        self.mse = nn.MSELoss()
        
    def forward(self, pred, actual):
        #print(pred,actual)
        res = torch.sqrt(self.mse(pred,actual))
        if torch.isnan(pred).sum().item()>0:
            ipdb.set_trace()
        if torch.isnan(res).sum().item()>0:
            ipdb.set_trace()
        return res

criterion = RMSLELoss()

I think that after a few iterations I am getting a float(-inf) somewhere, but I am outputting the weights and to weights&biases and they seems normal.
If I understand correctly, the error happens in the backward pass, so how can I debug that?

It seems that this issue doesn’t happen when running with less data. I run 10 epochs with 5000 samples, and 20 epochs with 500 samples. Both of them seems ok, and the loss is reducing for train and validation

The model is a combination of CNN and a categorical features. I normalized both the images and the features

See below the weights and biases from the layers.

{
                    'epoch': epoch+1,
                    'step': step ,
                    'data0' : data[0].sum(),
                    'data1' : data[1].sum(),
                    
                    'cc1w' : model.conv1.weight.sum().detach().cpu().item(),
                    'cc2w' : model.conv2.weight.sum().detach().cpu().item(),
                    'cc3w' : model.conv3.weight.sum().detach().cpu().item(),
                    'cc4w' : model.conv4.weight.sum().detach().cpu().item(),

                    'cc1b' : model.conv1.bias.sum().detach().cpu().item(),
                    'cc2b' : model.conv2.bias.sum().detach().cpu().item(),
                    'cc3b' : model.conv3.bias.sum().detach().cpu().item(),
                    'cc4b' : model.conv4.bias.sum().detach().cpu().item(),

                    
                    'fc1w' : model.fc1.weight.sum().detach().cpu().item(),
                    'fc1b' : model.fc1.bias.sum().detach().cpu().item(),
                    'fc2w' : model.fc2.weight.sum().detach().cpu().item(),
                    'fc2b' : model.fc2.bias.sum().detach().cpu().item(),
                    'fc3w' : model.fc3.weight.sum().detach().cpu().item(),
                    'fc3b' : model.fc3.bias.sum().detach().cpu().item(),
                    'fc4w' : model.fc4.weight.sum().detach().cpu().item(),
                    'fc4b' : model.fc4.bias.sum().detach().cpu().item(),
                  } 

How do you suggest me to continue from here?

Your loss might get a zero value, which would create an Inf or NaN gradient:

x = torch.randn(1, 1, requires_grad=True)
y = torch.zeros(1, 1)
criterion = nn.MSELoss()

loss = torch.sqrt(criterion(x * 0, y))
loss.backward()
print(x.grad)
> tensor([[nan]])

You could add a small eps value to the torch.sqrt calculation to avoid it.