Focal loss for imbalanced multi class classification in Pytorch

I want an example code for Focal loss in PyTorch for a model with three class prediction. My model outputs 3 probabilities.

(embedding): Embedding(19612, 400)
(lstm): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5)
(dropout): Dropout(p=0.5, inplace=False)
(fc): Linear(in_features=512, out_features=3, bias=True)
(sig): Sigmoid() )

My class distribution is highly imbalanced. So I want to try focal loss so that the minor class accuracy is improved.

I currently used loss function defined in But it didn’t help.

The original paper( only consider binary classification. How do I extend it to the multi-class scenario?

1 Like

I don’t think you would want sigmoid for multi-class (I’m assuming you mean multi-class rather than multi-label and already train with (unfocused - ha!) cross entropy loss).
If your regular cross entropy loss is “ce_loss”, you can just define alpha and gamma and do as in the linked function

ce_loss = torch.nn.functional.cross_entropy(outputs, targets, reduction='none') # important to add reduction='none' to keep per-batch-item loss
pt = torch.exp(-ce_loss)
focal_loss = (alpha * (1-pt)**gamma * ce_loss).mean() # mean over the batch

Best regards



Here is my network def: I am not usinf the sigmoid layer as cross entropy takes care of it. so I pass the raw logits to the loss function

import torch.nn as nn

class Sentiment_LSTM(nn.Module):
    We are training the embedded layers along with LSTM for the sentiment analysis

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        Settin up the parameters.
        super(Sentiment_LSTM, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        # embedding layer and LSTM layers 
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        # dropout layer to avoida over fitting
        self.dropout = nn.Dropout(0.5)
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()

    def forward(self, x):
        Perform a forward pass

        batch_size = x.size(0)

        x = x.long()
        embeds = self.embedding(x)

        lstm_out, hidden = self.lstm(embeds)

        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)

        out = self.dropout(lstm_out)
        out = self.fc(out)

        # sigmoid function
        sig_out = out
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1,3)
        sig_out = sig_out[:, -1,:] # get last batch of labels
        # return last sigmoid output and hidden state
        return sig_out
    def init_hidden(self, batch_size):
        #initilizing hidden layers
        weight = next(self.parameters()).data
        if (train_on_gpu):
            hidden = (, batch_size, self.hidden_dim).zero_().cuda(),
        , batch_size, self.hidden_dim).zero_().cuda())
            hidden = (, batch_size, self.hidden_dim).zero_(),
            , batch_size, self.hidden_dim).zero_())
        return hidden

My loss function:

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, logits=False, reduce=True):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.logits = logits
        self.reduce = reduce

    def forward(self, inputs, targets):nn.CrossEntropyLoss()
        BCE_loss = nn.CrossEntropyLoss()(inputs, targets, reduce=False)

        pt = torch.exp(-BCE_loss)
        F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss

        if self.reduce:
            return torch.mean(F_loss)
            return F_loss

My output has size of 3: It has to predict if the sentiment is positive , negative or neutral.
I data is imblanced. Neutral is around 7000, positve around 250 and negative around 800. Is my understanding and the implementation makes sense?

1 Like

This doesn’t look right.

  • You probably just want the functional version (see above) and pass reduction='none' to be modern.
  • I’d not call it BCE_loss.

My impression is that focal loss may help, but there are quite a few ways to do this, the most simple one is balanced sampling during training and a recent one is the weighted loss function from .

Best regards



Can you answer this aswell?

what’s the value of alpha here?

1 Like

alpha is an additional weighting factor between classes. In the paper linked above it is introduced in eq (5).

Best regards


Why we need use the below line in focal loss, as per paper, pt = p if y==1, otherwise 1-p


Why torch.exp ?? What does it reflect here

The ce_loss is a negative log likelihood and so torch.exp(-ce_loss) is the likelihood (i.e. between 0 and 1 etc.).

I got it ,
but why do we need torch.exp here
Could you please clarify it with a simple example

I’m not sure I understand? torch.exp of a log likelihood gives you the likelihood because exp is the inverse operation of log.

alpha is really a hard hyper-parameters。。。