I am fighting this error for few hours now.
It happens during the first epoch after at about 5 minutes of running of train (about 100 batches). ipdb is not stopping. I checked the inputs, it has no Nans.
This is the loss. I tried to wrap it with ipdb to catch the error, but it is not triggered.
import ipdb
class RMSLELoss(nn.Module):
def __init__(self):
super().__init__()
self.mse = nn.MSELoss()
def forward(self, pred, actual):
#print(pred,actual)
res = torch.sqrt(self.mse(pred,actual))
if torch.isnan(pred).sum().item()>0:
ipdb.set_trace()
if torch.isnan(res).sum().item()>0:
ipdb.set_trace()
return res
criterion = RMSLELoss()
I think that after a few iterations I am getting a float(-inf) somewhere, but I am outputting the weights and to weights&biases and they seems normal.
If I understand correctly, the error happens in the backward pass, so how can I debug that?
It seems that this issue doesn’t happen when running with less data. I run 10 epochs with 5000 samples, and 20 epochs with 500 samples. Both of them seems ok, and the loss is reducing for train and validation
The model is a combination of CNN and a categorical features. I normalized both the images and the features
See below the weights and biases from the layers.
{
'epoch': epoch+1,
'step': step ,
'data0' : data[0].sum(),
'data1' : data[1].sum(),
'cc1w' : model.conv1.weight.sum().detach().cpu().item(),
'cc2w' : model.conv2.weight.sum().detach().cpu().item(),
'cc3w' : model.conv3.weight.sum().detach().cpu().item(),
'cc4w' : model.conv4.weight.sum().detach().cpu().item(),
'cc1b' : model.conv1.bias.sum().detach().cpu().item(),
'cc2b' : model.conv2.bias.sum().detach().cpu().item(),
'cc3b' : model.conv3.bias.sum().detach().cpu().item(),
'cc4b' : model.conv4.bias.sum().detach().cpu().item(),
'fc1w' : model.fc1.weight.sum().detach().cpu().item(),
'fc1b' : model.fc1.bias.sum().detach().cpu().item(),
'fc2w' : model.fc2.weight.sum().detach().cpu().item(),
'fc2b' : model.fc2.bias.sum().detach().cpu().item(),
'fc3w' : model.fc3.weight.sum().detach().cpu().item(),
'fc3b' : model.fc3.bias.sum().detach().cpu().item(),
'fc4w' : model.fc4.weight.sum().detach().cpu().item(),
'fc4b' : model.fc4.bias.sum().detach().cpu().item(),
}
How do you suggest me to continue from here?