Verify the implementation of loss function

I am trying to implement the Generalized Dice Loss function by myself. When I tested the output of my implementation with monai, I got the same result. However, the training losses of these two functions were different when I used them in a training pipeline. I do not know what is the reason for this difference, and I am also not sure about the way I compare the two loss functions. Below is the code I used to test the implementation:

import torch
from loss_factory import LossFunctionType, LossFunctionFactory
from monai import losses


loss_fn = LossFunctionFactory.construct_loss_function(LossFunctionType.GENERALIZED_DICE)
monai_loss_fn = losses.dice.GeneralizedDiceLoss(to_onehot_y=True, softmax=True)
num_classes = 2

inputs = torch.randn((1, num_classes, 3, 4,), requires_grad=True)
targets = torch.empty((1, 1, 3, 4), dtype=torch.long).random_(num_classes)
loss = loss_fn(inputs, targets)
print(loss)

inputs_ = inputs.detach().clone().requires_grad_()
monai_loss = monai_loss_fn(inputs_, targets)
print(monai_loss)

Hi Richard!

If I understand you correctly, you are saying that the two computed loss values
are equal (up to potential round-off error), but that training with the two loss
values differs.

This suggests that you should also test the gradients of the two losses:

loss.backward()         # computes the gradient of your implementation
print (inputs.grad)     # print your gradient

monai_loss.backward()   # computes the gradient of the monai implementation
print (inputs_.grad)    # print the monai gradient

print (torch.allclose (inputs.grad, inputs_.grad, atol = 1.e-4)   # check equality to some appropriate tolerance

Best.

K. Frank

I have followed your suggestion and the result still shows that these 2 functions produce the same

tensor(0.5856, grad_fn=<MeanBackward0>)
tensor(0.5856, grad_fn=<MeanBackward0>)
tensor([[[[ 0.0097,  0.0145,  0.0016,  0.0126],
          [ 0.0150, -0.0162,  0.0149, -0.0120],
          [-0.0283,  0.0107,  0.0146, -0.0294]],

         [[-0.0097, -0.0145, -0.0016, -0.0126],
          [-0.0150,  0.0162, -0.0149,  0.0120],
          [ 0.0283, -0.0107, -0.0146,  0.0294]]]])
tensor([[[[ 0.0097,  0.0144,  0.0016,  0.0126],
          [ 0.0150, -0.0162,  0.0149, -0.0120],
          [-0.0283,  0.0107,  0.0145, -0.0294]],

         [[-0.0097, -0.0144, -0.0016, -0.0126],
          [-0.0150,  0.0162, -0.0149,  0.0120],
          [ 0.0283, -0.0107, -0.0145,  0.0294]]]])
True

Hi Richard!

Okay, that looks good.

Here are three more debugging suggestions:

First, make sure any randomness in your training pipeline – shuffling in your
dataloader, dropouts, etc. – is the same in the two training runs you are
comparing. If it’s not, you could easily get large differences in your results
that have nothing to do with the two version of your loss function.

Second, when you train, small differences due to expected round-off error can
accumulate over many optimization cycles. The two training runs can wander
off in two (equally-good) directions, leading to results that differ by a lot.

Try optimizing with plain-vanilla SGD with no momentum and a small learning
rate. Do your results agree (within a smallish multiple of round-off error) after
a single optimization step? Do they continue to agree reasonably well after a
handful of optimization steps? If your two training runs diverge from one another
only rather slowly, things are probably fine, even if the two runs end up with
distinctly different results.

Third, it could be that the two versions of your loss function agree for the
majority of inputs, but there are some edge cases (for example, maybe the
two versions use a different value of epsilon to protect against division by
very small values or zero in the Dice-coefficient computation) where they
give significantly different results. You could try running your consistency
check (including the gradient) on a large number of random inputs. You might
also consider salting your random inputs with possible edge-case examples.

Good luck!

K. Frank