Binary classification task on imbalanced dataset: loss not decreasing

Hi all!
I’m trying to implement a simple neural network to address a binary classification problem on an imbalanced dataset (21% positive vs 79% negative).
I can’t find any configuration (hyper parameters or model architecture) that make the training/val loss decrease: the loss is highly unstable around the 0.6 value (I assume it’s near the log(2)) so I wondering what could be the problem. I performed a dataset cleaning process in which I reduced the features dimensionality from 45 to 8: however, the correlation of each feature to the label is between 0.13 and 0.21 and this worry me.

To solve the imbalanced problem I oversample the positive data during the training in such a way to have 50-50 pos-neg ratio in each batch.

Actually, this is my model architecture:

class Network(nn.Module):
    def __init__(self,in_ch,out_ch):
        # Inputs to hidden layer linear transformation
        self.fc1 = nn.Linear(in_ch, 16)
        self.bn1 = nn.BatchNorm1d(16)
        self.fc2 = nn.Linear(16, 32)
        self.bn2 = nn.BatchNorm1d(32)
        self.fc3 = nn.Linear(32, 64)
        self.bn3 = nn.BatchNorm1d(64)
        self.fc4 = nn.Linear(64, 32)
        self.fc5 = nn.Linear(32, out_ch)

    def forward(self, x):
        x = F.leaky_relu(self.bn1(self.fc1(x)))
        x = F.leaky_relu(self.bn2(self.fc2(x)))
        x = F.leaky_relu(self.bn3(self.fc3(x)))
        x = F.leaky_relu(self.fc4(x))
        return self.fc5(x)

I’ve already tried different activations functions (relu and tanh) and also to remove batch normalization.
I use a simple SGD optimizer.

Do you have any suggestion on what could be the problem?

I don’t know what is your batch size, learning rate and dataset size but you use leaky relu activation function in the last layer. Its appropriate to use sigmoid activation function for binary classification in the last layer. Also I think your model can’t learn features because of poor network so I can suggest to enlarge your linear layers which means increase the hidden units. If you use small batch size, you can remove batch norm layers. Last, you can normalize input features to speed up gradient descent.

Hi @Erhan, thank you for your reply. I was trying to use a batch size=20 (10 positive + 10 negative samples). Moreover, the input features are already normalised according to min-max and I switched to leaky relu wrt sigmoid as I’m using CrossEntropyLoss criterion with two classes rather than BCE, do you think this could be a problem?
I will try to enlarge my hidden units. Thank you, I really appreciate if you can say me more about the previous points.

Batch size looks like normal. CrossEntropyLoss combines Softmax and NLLLoss that mean apply softmax function to last layer. Softmax uses for multiclass classification problem. So it’s appropriate to use BCE loss rather than CrossEntropyLoss for binary classification. You train your neural network longer in these configurations and try different learning rates.