Tresholding the prediction image to binary before sending to loss function

Hi!

There is a question that I have wondered and I have tried to search for it but cant find anything. My question is:

Is it wrong to treshold the prediction image (output from a segmentation network) and send it to the loss function. By that I mean:

prediction_image = model(image)
prediction_image[prediction_image >= treshold] = 1.0
prediction_image[prediction_image < treshold] = 0.0

loss = loss_function(prediction_image, ground_truth)

The ground truth is already a mask consisting of 0 and 1. Meaning if we have modified dice score (tversky), which is

inputs = inputs.view(-1)
        targets = targets.view(-1)
        
        #True Positives, False Positives & False Negatives
        TP = (inputs * targets).sum()    
        FP = ((1-targets) * inputs).sum()
        FN = (targets * (1-inputs)).sum()
       
        Tversky = (TP + smooth) / (TP + alpha*FP + beta*FN + smooth)  
        
        return 1 - Tversky

woudnt that be more correct?

Naively I would worry that creating this kind of kink in your output would zero out your gradients unless your output happens to be right around the threshold. This would make learning slower (and perhaps jumpier?) However, perhaps you’ve already tried and found that it works, in which case you could post a a plot of your validation loss vs baseline.

Thanks for the reply Andrei, really appriciated it! I have not done it yet. I have done it for metrics though. Before I calculate recall and dice score, I set

prediction_image[prediction_image >= treshold] = 1.0
prediction_image[prediction_image < treshold] = 0.0

And then calculate TP, FN and FP. Do you recommend that, or should I just have the raw prediction image when calculating different metrics (by this I do not mean loss function, only evaluation metrics)

Suppose we call model A the one trained without thresholding, and model B your suggested new model, trained by first thresholding and then applying the loss function. The two models have the same architecture, but different parameters (due to the different training approaches used).

I think when looking at metrics, the important thing for this particular question is to do the same thing for both model A and B so we are comparing apples to apples. Depending on your problem, you may very likely be “forced” to threshold by the type of answer your model is required to provide anyway (like you may have to allocate every pixel to either type X or type Y) so I don’t see the issue with doing it for evaluation.

EDIT: The “kinked” prediction may actually not be differentiable, in which case you should expect to see no training at all.

EDIT 2: per the snippet below, indeed I think this type of thresholding would zero out the gradients and would prevent any training.

normally you expect this:

vec1 = torch.rand(10, requires_grad=True)
vec2 = torch.rand(10, requires_grad=True)
s = vec1 + vec2
print(s)
s.sum().backward()
print(vec1.grad)
print(vec2.grad)

Output:
tensor([0.6766, 1.4995, 0.7432, 1.7373, 1.1667, 0.8818, 1.2005, 0.6173, 0.8586,
        0.6945], grad_fn=<AddBackward0>)
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

With thresholding:

vec1 = torch.rand(10, requires_grad=True)
vec2 = torch.rand(10, requires_grad=True)
s = vec1 + vec2
s[s > 1.0] = 1.0  # thresholding
s[s < 1.0] = 0.0  # thresholding
print(s)
s.sum().backward()
print(vec1.grad)
print(vec2.grad)

Output:
tensor([1., 1., 1., 1., 1., 0., 0., 1., 0., 0.], grad_fn=<IndexPutBackward>)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Note the gradients (last two tensors in the output) are zero everywhere, which means they can’t train.

However I think even if you came up with some implementation that’s not quite so kinked and still accomplished thresholding to some degree (raising the difference from the threshold to some power, for example) you would still find that it will get gradients closer to zero and slow down training.

If at all possible, would you be able to try how straight-through estimator works in your case?

I do lesion segmentation for my thesis, so forcing the prediction image to either background and foreground in this case is good?

Background : 0
Foreground/lesion : 1

For evaluation, yes it should be fine.

Please do check out InnovArul’s reply and link above as well where is suggesting you try something called “straight through estimator” before calculating loss. In thinking about it more, direct thresholding (like what we were discussing) is guaranteed to zero out the gradients and to therefore will not work (though you should confirm that yourself) however the estimator he suggests at least has a chance (though I’m still trying to get my head around it, based on the debate in the linked thread).

straight-through estimator, taken from other thread:

thresholded_inputs = torch.where(thresholded_inputs < threshold, 0, 1)
inputs = (inputs + thresholded_inputs) - inputs.detach()

Hi innovarul! Thank you for the reply☺️

The thing is, i dont have problem with my model learning, I just want to help the model and optimizer a bit, and see if the validation dice score increases or not. Do you think this technique will help?

I am not so sure if it will improve. You can give it a try and see how it goes.

When the thresholding operation (at 0.5) is applied, all probabilities between [0.5, 1] are mapped to 1. Thus it may easily satisfy the loss function, even though the actual probabilities might be less (< 0.7 for example). There is a chance that it might degrade the performance too. You have to train and check how it works.

Hi Andrei (and Einrone and Arul)!

For completeness, let me link to some further comments on this issue
that I posted in the aforementioned thread:

Best.

K. Frank