Training a model considering the predict probability range rather than the actual label

This might be a novice question, I have a multi class model for which I have a linear layer with sigmoid activation as the final output. I am using BCEwithLogit loss function to calculate loss, for the evaluation of the model or to get the predictions I am using the probability threshold > 0.5 to mark the output as positve and <0.5 as negative. The question I have is this threshold of >0.5 embedded with the loss calculation for training or do I have to add a threshold condition in the model if I want to train the model based on the threshold?

In short: no you do not have to add anything about the threshold you have chosen to the loss function.

Your model isn’t calculating the loss based the label that is ultimately chosen. It is calculating loss based on the strength of the prediction with respect to the correct label.

For example, if the model assigns a probability of 0.55 to class 1 (suppose this is the true label), there is still a loss value associated with this, even though it is the correct label. That is because the model is trying to maximize the probability of choosing the correct label. So predicting 0.55 for the correct class results in a higher loss than predicting 0.95 for the same class, even though at inference time the model would be choosing the same label.

Thanks for the reply.

That helps if I am expecting a single label from the model, as I have to check for the maximum of all the node outputs to label it as the positive, in my case I can accept more than one label as the output for the same input.

Also, wouldn’t it help to train the model much faster by reducing the loss if I could accept a wide range as correct rather than a single point for comparison ?

The reason you can’t choose a label as correct or incorrect during training time (hinge loss) is that operation is not differentiable. So for instance, if you wanted to say that either class 1 and 3 are correct, there is no way to specify this with a differentiable operation. Argmax isn’t differentiable and neither is something like ‘class_1> class_1_decision_threshold’. That is why we train by maximizing the likelihood of choosing the correct class; doing so is entirely differentiable.