Focal loss for imbalanced multi class classification in Pytorch

VikasRajashekar · November 17, 2019, 7:40pm

I want an example code for Focal loss in PyTorch for a model with three class prediction. My model outputs 3 probabilities.

Sentiment_LSTM(
(embedding): Embedding(19612, 400)
(lstm): LSTM(400, 512, num_layers=2, batch_first=True, dropout=0.5)
(dropout): Dropout(p=0.5, inplace=False)
(fc): Linear(in_features=512, out_features=3, bias=True)
(sig): Sigmoid() )

My class distribution is highly imbalanced. So I want to try focal loss so that the minor class accuracy is improved.

I currently used loss function defined in https://www.kaggle.com/c/tgs-salt-identification-challenge/discussion/65938 But it didn’t help.

The original paper(https://arxiv.org/abs/1708.02002) only consider binary classification. How do I extend it to the multi-class scenario?

tom · November 18, 2019, 11:47am

I don’t think you would want sigmoid for multi-class (I’m assuming you mean multi-class rather than multi-label and already train with (unfocused - ha!) cross entropy loss).
If your regular cross entropy loss is “ce_loss”, you can just define alpha and gamma and do as in the linked function

ce_loss = torch.nn.functional.cross_entropy(outputs, targets, reduction='none') # important to add reduction='none' to keep per-batch-item loss
pt = torch.exp(-ce_loss)
focal_loss = (alpha * (1-pt)**gamma * ce_loss).mean() # mean over the batch

Best regards

Thomas

VikasRajashekar · November 18, 2019, 12:05pm

Here is my network def: I am not usinf the sigmoid layer as cross entropy takes care of it. so I pass the raw logits to the loss function

import torch.nn as nn

class Sentiment_LSTM(nn.Module):
    """
    We are training the embedded layers along with LSTM for the sentiment analysis
    """

    def __init__(self, vocab_size, output_size, embedding_dim, hidden_dim, n_layers, drop_prob=0.5):
        """
        Settin up the parameters.
        """
        super(Sentiment_LSTM, self).__init__()

        self.output_size = output_size
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        
        # embedding layer and LSTM layers 
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.lstm = nn.LSTM(embedding_dim, hidden_dim, n_layers, 
                            dropout=drop_prob, batch_first=True)
        
        # dropout layer to avoida over fitting
        self.dropout = nn.Dropout(0.5)
        
        # linear and sigmoid layers
        self.fc = nn.Linear(hidden_dim, output_size)
        self.sig = nn.Sigmoid()
        

    def forward(self, x):
        """
        Perform a forward pass

        """
        batch_size = x.size(0)

        x = x.long()
        embeds = self.embedding(x)

        lstm_out, hidden = self.lstm(embeds)

    
        # stack up lstm outputs
        lstm_out = lstm_out.contiguous().view(-1, self.hidden_dim)


        out = self.dropout(lstm_out)
        out = self.fc(out)

        # sigmoid function
        sig_out = out
        
        # reshape to be batch_size first
        sig_out = sig_out.view(batch_size, -1,3)
        #print("sig_out",sig_out.shape)
        sig_out = sig_out[:, -1,:] # get last batch of labels
        
        # return last sigmoid output and hidden state
        return sig_out
    
    
    def init_hidden(self, batch_size):
        #initilizing hidden layers
        weight = next(self.parameters()).data
        
        if (train_on_gpu):
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda(),
                  weight.new(self.n_layers, batch_size, self.hidden_dim).zero_().cuda())
        else:
            hidden = (weight.new(self.n_layers, batch_size, self.hidden_dim).zero_(),
                      weight.new(self.n_layers, batch_size, self.hidden_dim).zero_())
        
        return hidden

My loss function:

class FocalLoss(nn.Module):
    def __init__(self, alpha=1, gamma=2, logits=False, reduce=True):
        super(FocalLoss, self).__init__()
        self.alpha = alpha
        self.gamma = gamma
        self.logits = logits
        self.reduce = reduce

    def forward(self, inputs, targets):nn.CrossEntropyLoss()
    
        BCE_loss = nn.CrossEntropyLoss()(inputs, targets, reduce=False)

        pt = torch.exp(-BCE_loss)
        F_loss = self.alpha * (1-pt)**self.gamma * BCE_loss

        if self.reduce:
            return torch.mean(F_loss)
        else:
            return F_loss

My output has size of 3: It has to predict if the sentiment is positive , negative or neutral.
I data is imblanced. Neutral is around 7000, positve around 250 and negative around 800. Is my understanding and the implementation makes sense?

tom · November 18, 2019, 9:15pm

This doesn’t look right.

You probably just want the functional version (see above) and pass reduction='none' to be modern.
I’d not call it BCE_loss.

My impression is that focal loss may help, but there are quite a few ways to do this, the most simple one is balanced sampling during training and a recent one is the weighted loss function from [1901.05555] Class-Balanced Loss Based on Effective Number of Samples .

Best regards

Thomas

Umair_Javaid · May 22, 2020, 3:50am

Can you answer this aswell?

Umair_Javaid · May 25, 2020, 2:36am

what’s the value of alpha here?

tom · May 29, 2020, 6:13am

alpha is an additional weighting factor between classes. In the paper linked above it is introduced in eq (5).

Best regards

Thomas

shakeel608 · June 2, 2021, 9:24am

Why we need use the below line in focal loss, as per paper, pt = p if y==1, otherwise 1-p

torch.exp(-ce_loss)

Why torch.exp ?? What does it reflect here

tom · June 2, 2021, 9:48am

The ce_loss is a negative log likelihood and so torch.exp(-ce_loss) is the likelihood (i.e. between 0 and 1 etc.).

shakeel608 · June 2, 2021, 10:00am

I got it ,
but why do we need torch.exp here
Could you please clarify it with a simple example

tom · June 2, 2021, 10:23am

I’m not sure I understand? torch.exp of a log likelihood gives you the likelihood because exp is the inverse operation of log.

XavierMFC · May 6, 2022, 8:58am

alpha is really a hard hyper-parameters。。。

Xinchengzelin · May 9, 2023, 12:20pm

alpha can’t be the balanced factor as the paper, it’s just the scaled factor, right?

MaxWolf-01 · March 7, 2024, 7:45pm

Would it be correct to use alpha already in the cross_entropy calculation as weight like this?

ce_loss = torch.nn.functional.cross_entropy(outputs, targets, reduction='none', weight=alpha) 
pt = torch.exp(-ce_loss)
focal_loss = ((1-pt)**gamma * ce_loss).mean()

Shix0 · April 6, 2024, 12:50am

@MaxWolf-01 I was wondering about the same idea. I’ve tried that, and I am getting unstable loss values (keep fluctuating between extremes). Without using class weights (i.e., weight=None), the loss values are stable, but Focal loss overfits in comparison to nn.CrossEntropyLoss with class weights in that case.

@tom Any ideas here?

Here is the implementation:

class FocalLoss(torch.nn.Module):
“”"Implementation of the Focal loss function

    Args:
        weight: class weight vector to be used in case of class imbalance
        gamma: hyper-parameter for the focal loss scaling.
"""
def __init__(self, weight=None, gamma=2):
    super(FocalLoss, self).__init__()
    self.gamma = gamma
    self.weight = weight #weight parameter will act as the alpha parameter to balance class weights

def forward(self, outputs, targets):
    ce_loss = torch.nn.functional.cross_entropy(outputs, targets, reduction='none', weight=self.weight) 
    pt = torch.exp(-ce_loss)
    focal_loss = ((1-pt)**self.gamma * ce_loss).mean() # mean over the batch
    return focal_loss