Suppose we call model A the one trained without thresholding, and model B your suggested new model, trained by first thresholding and then applying the loss function. The two models have the same architecture, but different parameters (due to the different training approaches used).
I think when looking at metrics, the important thing for this particular question is to do the same thing for both model A and B so we are comparing apples to apples. Depending on your problem, you may very likely be “forced” to threshold by the type of answer your model is required to provide anyway (like you may have to allocate every pixel to either type X or type Y) so I don’t see the issue with doing it for evaluation.
EDIT: The “kinked” prediction may actually not be differentiable, in which case you should expect to see no training at all.
EDIT 2: per the snippet below, indeed I think this type of thresholding would zero out the gradients and would prevent any training.
normally you expect this:
vec1 = torch.rand(10, requires_grad=True)
vec2 = torch.rand(10, requires_grad=True)
s = vec1 + vec2
print(s)
s.sum().backward()
print(vec1.grad)
print(vec2.grad)
Output:
tensor([0.6766, 1.4995, 0.7432, 1.7373, 1.1667, 0.8818, 1.2005, 0.6173, 0.8586,
0.6945], grad_fn=<AddBackward0>)
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
With thresholding:
vec1 = torch.rand(10, requires_grad=True)
vec2 = torch.rand(10, requires_grad=True)
s = vec1 + vec2
s[s > 1.0] = 1.0 # thresholding
s[s < 1.0] = 0.0 # thresholding
print(s)
s.sum().backward()
print(vec1.grad)
print(vec2.grad)
Output:
tensor([1., 1., 1., 1., 1., 0., 0., 1., 0., 0.], grad_fn=<IndexPutBackward>)
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
tensor([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])
Note the gradients (last two tensors in the output) are zero everywhere, which means they can’t train.
However I think even if you came up with some implementation that’s not quite so kinked and still accomplished thresholding to some degree (raising the difference from the threshold to some power, for example) you would still find that it will get gradients closer to zero and slow down training.