Could you please help me figure why I am getting NAN loss value and how to debug and fix it?
P.S.: Why my losses are so large and how can I fix them?
After running this cell of code:
network = Network()
network.cuda()
criterion = nn.MSELoss()
optimizer = optim.Adam(network.parameters(), lr=0.0001)
loss_min = np.inf
num_epochs = 10
start_time = time.time()
for epoch in range(1,num_epochs+1):
loss_train = 0
loss_test = 0
running_loss = 0
network.train()
print('size of train loader is: ', len(train_loader))
for step in range(1,len(train_loader)+1):
batch = next(iter(train_loader))
images, landmarks = batch['image'], batch['landmarks']
images = images.permute(0,3,1,2)
images = images.cuda()
#RuntimeError: Given groups=1, weight of size [64, 3, 7, 7], expected input[64, 600, 800, 3] to have 3 channels, but got 600 channels instead
landmarks = landmarks.view(landmarks.size(0),-1).cuda()
##images = torchvision.transforms.Normalize(images)
##landmarks = torchvision.transforms.Normalize(landmarks)
predictions = network(images)
# clear all the gradients before calculating them
optimizer.zero_grad()
# find the loss for the current step
loss_train_step = criterion(predictions.float(), landmarks.float())
##loss_train_step = loss_train_step.to(torch.float32)
# calculate the gradients
loss_train_step.backward()
# update the parameters
optimizer.step()
loss_train += loss_train_step.item()
running_loss = loss_train/step
print_overwrite(step, len(train_loader), running_loss, 'train')
network.eval()
with torch.no_grad():
for step in range(1,len(test_loader)+1):
batch = next(iter(train_loader))
images, landmarks = batch['image'], batch['landmarks']
images = images.permute(0,3,1,2)
images = images.cuda()
landmarks = landmarks.view(landmarks.size(0),-1).cuda()
predictions = network(images)
# find the loss for the current step
loss_test_step = criterion(predictions, landmarks)
loss_test += loss_test_step.item()
running_loss = loss_test/step
print_overwrite(step, len(test_loader), running_loss, 'Testing')
loss_train /= len(train_loader)
loss_test /= len(test_loader)
print('\n--------------------------------------------------')
print('Epoch: {} Train Loss: {:.4f} Test Loss: {:.4f}'.format(epoch, loss_train, loss_test))
print('--------------------------------------------------')
if loss_test < loss_min:
loss_min = loss_test
torch.save(network.state_dict(), '../moth_landmarks.pth')
print("\nMinimum Test Loss of {:.4f} at epoch {}/{}".format(loss_min, epoch, num_epochs))
print('Model Saved\n')
print('Training Complete')
print("Total Elapsed Time : {} s".format(time.time()-start_time))
I get the following NAN losses:
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 1 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 2 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 3 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 4 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 5 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 6 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 7 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 8 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 9 Train Loss: nan Test Loss: nan
--------------------------------------------------
size of train loader is: 90
Valid Steps: 10/10 Loss: nan
--------------------------------------------------
Epoch: 10 Train Loss: nan Test Loss: nan
--------------------------------------------------
Training Complete
Total Elapsed Time : 934.4894697666168 s
Here’s the network:
num_classes = 4 * 2 #4 coordinates X and Y flattened --> 4 of 2D keypoints or landmarks
class Network(nn.Module):
def __init__(self,num_classes=8):
super().__init__()
self.model_name = 'resnet18'
self.model = models.resnet18()
self.model.fc = nn.Linear(self.model.fc.in_features, num_classes)
def forward(self, x):
x = x.float()
out = self.model(x)
return out
If I comment the part related to ‘normalize’ I still get the NAN loss
transformed_dataset = MothLandmarksDataset(csv_file='moth_gt.csv',
root_dir='.',
transform=transforms.Compose(
[
Rescale(256),
RandomCrop(224),
ToTensor()#,
##transforms.Normalize(mean = [ 0.485, 0.456, 0.406 ],
## std = [ 0.229, 0.224, 0.225 ])
]
)
)
This is the result after I commented the transforms.Normalize and change the epochs to 1:
size of train loader is: 90
Valid Steps: 10/10 Loss: nan 8.5625
--------------------------------------------------
Epoch: 1 Train Loss: nan Test Loss: nan
--------------------------------------------------
Training Complete
Total Elapsed Time : 93.34211421012878 s