I don’t understand the difference between nn.MultiLabelSoftMarginLoss and nn.BCEWithLogitsLoss() when training a multi-label classification.
Ideally I would like sigmoid activation applied to the final-layer (instead of softmax which assumes mutually exclusive outputs) and then minimise the binary cross-entropy loss function. So it seems to me that I should use BCEWithLogitsLoss, but the I don’t understand the point of MultiLabelSoftMarginLoss?
Also in terms of accuracy, I would like to calculate exact matches e.g. y=(1 0 1 1) , y_hat=(1 1 1 1) is scored as 0 and y_hat=(1 0 1 1) is scored as one. Is it correct to assume that minimising the BCE loss will maximise this accuracy with a threshold of 0.5? (I wouldn’t want the loss to be minimised by assigning very high probabilities for each class - false positives should be penalised)
This seems like maximising some kind of F2 score because both over-predicting (1 1 1 1) and under-predicting (0 0 0 0) will give an incorrect match. I guess it would be similar to creating a one-label, multi-class where you have all possible combinations as potential classes and using standard multi-class CE loss; which is minimised only when the exact prediction is made.