Error loss: nan!

I get a sample batch of data from dataloader, I set batch size to 1. The image shape is
1 x 3 x 224 x 224, the label shape is 1 x 7 x 7 x 5. Now I tried to calculate the loss for one image, but i got the nan value, why? I also tried to train the network for whole batch, the loss is still nan. Thank you for reading.

face_data = FaceAnnoDataset(root_dir=path, img_dir ='image', anno_dir='label', 
                            txtfile='image.txt', transform=transforms.Compose([
                                transforms.Normalize([0.2341, 0.2388, 0.2622], [0.2210, 0.2150, 0.2543])])
train_loader = DataLoader(face_data, batch_size=1, shuffle=False, pin_memory=True,
                         num_workers=2, collate_fn=collate_fn)

# get one sample of batch, shape: 1 x 3 x 224 x 224
train_sample = next(iter(train_loader))

image = train_sample[0]
anno = train_sample[1]

model = DetectionNet()
# place model to GPU
image = image.cuda()
anno = anno.cuda()

y_pred = model(image)
y_pred = y_pred.permute(0,2,3,1)
loss = loss_fn(y_pred, anno)

torch.Size([1, 3, 224, 224])
torch.Size([1, 7, 7, 5])
torch.Size([1, 7, 7, 5])
tensor(nan, device='cuda:0', grad_fn=<AddBackward0>)
def loss_fn(y_pred, y):
  loss = conf_regression_loss(y_pred, y) + bbox_regression_loss(y_pred, y)
  return loss

def conf_regression_loss(y_pred, y, lamda=0.5):
    y_pred: output of forward propagation, shape: batch x grid x grid x 5
    y: ground truth, shape: batch x grid x grid x 5
    lamda: parameter of loss function of loss_no_obj, as no_obj cells dominates,
    we need to decrease the loss, otherwise no_obj overpowers the loss
    return: confidence loss
    # get the mask of actived grid cell, in which has an object according to ground
    # truth label
    mask = y[:,:,:,0] # shape: batch x grid x grid x 1
    y_pred_c = y_pred[:,:,:,0]
    # if object exits in the cell
    loss_obj = torch.sum((mask * y_pred_c - mask)**2) # mask == y_c, here confidence is equavalent to mask
    # if object doesn't exit in the cell, we have to decrease
    # the loss as number of cells which doesn't contain an object is
    # much larger that cells do
    #  get mask of no object where 1 indicates no object
    mask_no_obj = mask.clone()
    mask_no_obj[mask==0] = 1
    mask_no_obj[mask==1] = 0
    loss_no_obj = torch.sum((y_pred_c * mask_no_obj - mask)**2)
    loss_no_obj = loss_no_obj * lamda 
    loss = loss_obj + loss_no_obj
    return loss

def bbox_regression_loss(y_pred, y, lamda=5):
    mask = y[:,:,:,0]
    # loss of offset x, y
    loss_offset = torch.sum((mask * y_pred[:,:,:,1] - y[:,:,:,1])**2 +\
    (mask * y_pred[:,:,:,2] - y[:,:,:,2])**2)
    # loss of width and height
    loss_w_h = torch.sum((mask * torch.sqrt(y_pred[:,:,:,3]) - torch.sqrt(y[:,:,:,3]))**2 + \
    (mask * torch.sqrt(y_pred[:,:,:,3]) - torch.sqrt(y[:,:,:,3]))**2)
    loss = lamda * (loss_offset + loss_w_h)
    return loss

In your bbox_regression_loss you are calculating the torch.sqrt of y_pred.
Did you make sure to not pass negative values to this method, as this will create NaN outputs?

1 Like

Yes. That’s exactly where problem is. Thank you!

Hi, I still have a problem. When I run this code, I also get a nan error. Then I modified my code to the commented code, it worked. Why wasn’t the previous one working? Thank you in advance.

# Initialize and load Dataset
face_data = FaceAnnoDataset(root_dir=path, img_dir ='image', anno_dir='label', 
                            txtfile='image.txt', transform=transforms.Compose([
                                transforms.Normalize([0.2341, 0.2388, 0.2622], [0.2210, 0.2150, 0.2543])

train_loader = DataLoader(face_data, batch_size=32, shuffle=False, pin_memory=True,
                         num_workers=2, collate_fn=collate_fn)

optimizer = optim.Adam(model.parameters(), lr=0.0001, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)

step_lr_scheduler = lr_scheduler.StepLR(optimizer, step_size=20, gamma=0.85)
-----------------------This outputs a nan loss----------------------------------------------------------------------
train(model=model, train_loader=train_loader, criterion=loss_fn, optimizer=optimizer, scheduler=step_lr_scheduler, num_epochs=150)

The following works
#train(model=DetectionNet(), train_loader=train_loader, criterion=loss_fn, optimizer=optimizer, #scheduler=step_lr_scheduler, num_epochs=150)

If model is an instance of DetectionNet(), you might just be lucky to with the random number generator in the second approach and might not run into negative values in the sqrt.
How reproducible is the finding? I.e. if you both approaches with different seeds, is the first one always failing while the latter is successful?

I used clamp(min=0) for the output of forward propagation. I calculated the loss of two batches manually, one for each time. The loss works fine. I guess the problem is that when I tried this, and it didn’t work

train(model=model, train_loader=train_loader, criterion=loss_fn, optimizer=optimizer, scheduler=step_lr_scheduler, num_epochs=150)

then I modified the code and run the following since I use jupyterbook, the variable of previous excution still exits, then two different instances are created. It did calculate the loss, but it calculates the gradient of the first instance of the model.

train(model=DetectionNet(), train_loader=train_loader, criterion=loss_fn, optimizer=optimizer, #scheduler=step_lr_scheduler, num_epochs=150)

I added 1e-8 inside the sqrt(). It worked. I have an initial loss around 500, is it normal for a single-stage detector, after 90 epochs, the loss kind of stuck at 80, could you give me any advice? I used the following optimizer.

optimizer = optim.Adam(model.parameters(), lr=2e-5, betas=(0.9, 0.999), eps=1e-08, weight_decay=0, amsgrad=False)
scheduler = lr_scheduler.StepLR(optimizer, step_size=80, gamma=0.9)

Hi, could you give me any advice on training? Thank you. The training loss doesn’t decrease. Is it normal to have a such high loss, and stuck here? I tried different learning rates, 1e-3, 1e-4, 1e-5, and used a scheduler multiple the learning rate by 0.1 every 40 epochs. If the learning rate is a bit big, the loss should have changed after 40 and 80 epochs, but it seems not.

---Epochs: 0/300---Training loss:378.9626 time per epoch: 22.2s
tensor(463.3724, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:463.3724
---Epochs: 1/300---Training loss:296.5503 time per epoch: 22.0s
tensor(376.3032, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:376.3032
---Epochs: 2/300---Training loss:251.9911 time per epoch: 22.0s
tensor(339.0182, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:339.0182
---Epochs: 3/300---Training loss:213.2080 time per epoch: 22.0s
tensor(321.4742, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:321.4742
---Epochs: 4/300---Training loss:208.8714 time per epoch: 22.1s
tensor(317.3396, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:317.3396
---Epochs: 5/300---Training loss:201.6067 time per epoch: 22.2s
tensor(313.2032, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:313.2032
---Epochs: 6/300---Training loss:197.0980 time per epoch: 22.1s
tensor(310.2551, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:310.2551
---Epochs: 7/300---Training loss:198.4707 time per epoch: 22.0s
tensor(307.9322, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:307.9322
---Epochs: 8/300---Training loss:192.7238 time per epoch: 22.0s
tensor(307.5226, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:307.5226
---Epochs: 9/300---Training loss:193.3006 time per epoch: 22.2s
tensor(307.2828, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:307.2828
---Epochs: 10/300---Training loss:196.9237 time per epoch: 22.0s
tensor(307.1403, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:307.1403
---Epochs: 11/300---Training loss:209.6514 time per epoch: 22.1s
tensor(307.0083, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:307.0083
---Epochs: 12/300---Training loss:181.3789 time per epoch: 22.1s
tensor(306.9789, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:306.9789
---Epochs: 13/300---Training loss:189.3164 time per epoch: 22.1s
tensor(306.9626, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:306.9626
---Epochs: 14/300---Training loss:207.4668 time per epoch: 22.1s
tensor(306.9458, device='cuda:0', grad_fn=<AddBackward0>)
---Validation loss:306.9458