Dice-Score on imbalanced Validation Set


I have a dataset with positive and negative samples for a segmentation task where:
1. “positive” means, that the image contains at least one object
2. “negative” means, that the image contains no object of interest.

For every positive sample, there are roughly 3 negative samples in the dataset.

Now when I evaluate my model on my validation set, the dice-score will be very high in the beginning as the model just predicts 0 everywhere. After some iterations, the dice score will start to increase again but fails to reach the high scores from the start of the training. This makes the automatic saving of the best performing model very hard, if not impossible.

Any suggestions how to balance the dice-score such that it doesn’t favor negative examples during validation?


In my opinion you can try to put a weight to your dice score / loss function that is proportional to your dataset samples. If you have roughly 3 negative samples in the dataset, you should reflect it on the loss by setting a weight of 3.

Sorry for the very late answer, but I currently reiterate this issue.

Although your suggestion is possible, this will still lead to a dice-score of at least 0.5 when the output of the network is constant 0. What if the real working best performance of that model is only 0.4?

Meanwhile I searched the net a bit and am a little bit confused that this issue does not arise more often. Maybe someone else can step in?

Okay, I guess the solution is to aggregate all predictions and to calculate the dice-score over all results simultaniously.

I.e. instead of:
result = (dice_score(pred1,gt1) + dice_score(pred2,gt2)) / 2


result = dice_score(torch.cat([pred1, pred2, ...], dim=1), torch.cat([gt1, gt2, ...], dim=1))

This of course could need huge amounts of RAM and compute, therefore better aggregate the intersection and union and calculate the dice-score in the end.