Loss average is staying at nan always during model training

amitkayal · October 10, 2020, 6:13am

Hello,

I have my IOU loss function written and the model is always showing training and validation loss as nan. Please note that when i am switching over to cross entropy loss function then model training is working fine. So may me my loss function is something wrong? I have also printed loss value and average loss value in my loss function which are showing very less value…Can you all please help me on this? Thanks

Loss Function Code

class IntersectionOverUnion(nn.Module):

"""

    Implementation of the Soft-Dice Loss function.

    Arguments:

        num_classes (int): number of classes.

        eps (float): value of the floating point epsilon.

"""

def __init__(self, num_classes, eps=1e-5):

    super().__init__()

    # init class fields

    self.num_classes = num_classes

    self.eps = eps

# define the forward pass

def forward(self, preds, targets):  # pylint: disable=unused-argument

    """

        Compute Soft-Dice Loss.

        Arguments:

            preds (torch.FloatTensor):

                tensor of predicted labels. The shape of the tensor is (B, num_classes, H, W).

            targets (torch.LongTensor):

                tensor of ground-truth labels. The shape of the tensor is (B, 1, H, W).

        Returns:

            mean_loss (float32): mean loss by class  value.

    """

    loss = 0

    # iterate over all classes

    for cls in range(self.num_classes):
        # get ground truth for the current class
        target = (targets == cls).float()
        # get prediction for the current class
        pred = preds[:, cls]
        # calculate intersection
        intersection = (pred * target).sum()  # Will be zero if Truth=0 or Prediction=0
        ## calculate union for the current class
        union = (pred + target).sum() # Will be zzero if both are 0
        # compute dice coefficient
        # iou = (2 * intersection + self.eps) / (pred.sum() + target.sum() + self.eps)
        iou = (intersection + self.eps) / (union + self.eps) # We smooth our devision to avoid 0/0
        print("IOU Value:",iou)
        # compute negative logarithm from the obtained
        loss = loss - iou.log()
        print("loss Value:",iou)
        # get mean loss by class value
    loss = loss / self.num_classes
    print("loss Avg Value:",iou)
    return loss

Model Training Result:

> epoch: 2, test_miou: 0.090242, train_loss: nan, test_loss: nan: 25%
> 3/12 [1:22:27<3:35:14, 1434.99s/it]
> [0/12][Train][261] Loss_avg: nan, Loss: nan, LR: 1e-05: 100%
> 262/262 [1:11:44<00:00, 16.43s/it]
> Streaming output truncated to the last 5000 lines.
> IOU Value: tensor(-0.0143487109, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(-0.0143487109, device='cuda:0', grad_fn=<DivBackward0>)
> IOU Value: tensor(0.0014381389, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(0.0014381389, device='cuda:0', grad_fn=<DivBackward0>)
> IOU Value: tensor(-0.0324460752, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(-0.0324460752, device='cuda:0', grad_fn=<DivBackward0>)
> IOU Value: tensor(0.0008592299, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(0.0008592299, device='cuda:0', grad_fn=<DivBackward0>)
> IOU Value: tensor(-3.3268113264e-10, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(-3.3268113264e-10, device='cuda:0', grad_fn=<DivBackward0>)
> IOU Value: tensor(-1.4262332426e-09, device='cuda:0', grad_fn=<DivBackward0>)
> loss Value: tensor(-1.4262332426e-09, device='cuda:0', grad_fn=<DivBackward0>)

ptrblck · October 12, 2020, 6:23am

Could you post the code snippet, which prints:

Loss_avg: nan, Loss: nan, LR: 1e-05: 100%

I cannot find anything obviously wrong in your current code snippet and the output also seems to return valid values.