Calibrating probability output and threshold

Hello all,

I am working on a segmentation problem where there is significant class imbalance and came across this interesting paper where the authors who searching for a threshold yields more optimal results than a standard 0.5 threshold.

https://aapm.onlinelibrary.wiley.com/doi/full/10.1002/acm2.13331

Curious if anyone has experience with optimal post processing of a probably mask?

So two things:

  • In my experience, the first thing to do is balancing the pixels more. For example, when training the U-Net for nodule detection in our book (a very imbalanced problem), we take care that we have enough slices with nodules fed into the U-Net. A brief discussion is in section 13.5.5 (Designing our training and validation data). We didn’t write an entire paper about it, but in our situation it went from “doesn’t learn” (even with weighted dice) to “learns”.
  • If you have some holdout data, it is very reasonable to adjust the threshold, but if you trained your model well, it probably matters less, as UNet class probabilities tend to exhibit the same overconfidence thing known from regular classification to some degree (i.e. it’s either almost zero or almost one most of the time).

Best regards

Thomas

Hi @tom thanks for your comment!

So the model I am working on is training well with a soft dice loss ~0.10 (approximate dice of 0.90). I am formulating it as a 3d problem, so there isn’t the notion of oversampling the individual images (or "slices) but the data is similar to pulmonary nodules in some sense and only represents a fraction of the entire volume.

What’s interesting to me is that most people apply a threshold on a probability distribution but it seems like we could perhaps do significantly better? It’s usually apparent where the correct segmentation is from the “shape” and appearance of features in the probability distribution to some extent. Thresholds can naively miss this information?

So you cannot feed 3d cropped data? This can, of course, happen as UNets have a minimum input size. But then, maybe your data is less imbalanced than I thought.

In addition to thresholding on the probability distribution, smoothing the shape through morphological operations (dilation/erosion) and throwing out detection sites below a size threshold seem common postprocessing steps, too. We discuss some of that in chapter 14 of the book.

One of the crucial things in practice is to have a rigorous evaluation pipeline for experiments early on. I must admit that the ideaexpreased in the paper of comparing different tasks based on the respective dice score the achieve makes me a bit uncomfortable. One may come to the conclusion that the objective function does not work well on some tasks, but there also is a trap in mixing exploration and evaluation too much. Or maybe I’m just grumpy today. :slight_smile:

Best regards

Thomas