Labels smoothing seems to be important regularization technique now and important component of Sequence-to-sequence networks.
Implementing labels smoothing is fairly simple. It requires, however, one-hot encoded labels to be passed to the cost function (smoothing is changing one and zero to slightly different values).
Is there any way to implement it in PyTorch? Could I use maybe some different loss function, that accepts one-hot vectors, or rewrite nn.functional.cross_entropy so that the gradient can be derived?
Or maybe any different ideas how to use labels smoothing not going into Tensorflow?
Thank you. So I should just use this function? As prediction I should pass softmax output and as labels, and I should pass smoothed vector like [0.05, 0.05, 0.9], is that correct?
Ok so I understand that I shouldn’t use softmax function, instead my predictions should be raw output of the last linear layer, and I should pass smoothed vector like [0.05, 0.05, 0.9] as a target. Moreover, it is ok to use BCELoss for multi-class classification.
I am looking for the confirmation that my ideas are ok.
you are pretty much on spot with your last comment. raw preds go into BCELossWithLogits and you can use it for multi-class classification (target is one-hot encoding)
Hi, I got a bit confused here so sorry for asking again. So do we need to to label smoothing of target values or there is no need anymore if we use BCELossWithLogits ? Because in the last question by Dawid he mentioned that he should pass smoothed vector as a target, but from your answer it seems like we need to just pass the one-hot encoded (and not smoothed) target vector.
Thanks !
I think based on the paper, we need to convert the one-hot vector to the smoothed vector and use the original loss. But changing the loss criterion is doing the same math I think.
I have a question here, maybe naive but how does it work for multiclass?
I believe BCE loss and Sigmoid activation is for binary classes only. I Will be grateful to know about it in detail.
Multi-label classification use cases, where zero, one or multiple classes can be active in each sample, can use nn.BCEWithLogitsLoss as the loss function.
The model output in this case should be [batch_size, nb_classes].
Multi-class classification use cases, where only a single class is active for each sample, would use nn.CrossEntropyLoss.
You could use e.g. nn.KLDivLoss and add the weighting to the unreduced loss or use a manual implementation for label smoothing (you should be able to find some posts in this forum).