Loss not converging on a binary classification problem

Hi there,

I’m trying to implement a very simple model (multi layer perceptron) to tackle a binary classification problem but the loss function does not decrease and is saw-shaped.

I do have very few labelled samples: (train (60) | test (15)). The input data is tabular from 7 different data types, which is then normalized (max-min):

Each sample belong to class 0 or 1

The model:

class CustomModel(nn.Module):
     def __init__(self, ):
         torch.manual_seed(3)
         super(CustomModel, self).__init__()
         self.fc1 = nn.Linear(7, 60)
         self.fc2 = nn.Linear(60, 100)
         self.fc3 = nn.Linear(100, 30)
         self.fc4 = nn.Linear(30, 2)
         self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight.data)
            torch.nn.init.constant_(module.bias.data, 0)

    def forward(self, data):
        x = data
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x

The training process:

    # model instance
    model = CustomModel()
    device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

    # Optimizer and criterion
    optimizer = optim.SGD(model.parameters(), lr=0.0001)
    criterion = nn.CrossEntropyLoss()

    model.train(True)
    for i in range(1000):
        for i, batch in enumerate(train_loader):
            sample, ground= batch
            sample = sample.to(device=device, dtype=torch.float32)
            ground = ground.to(device=device, dtype=torch.long)

            optimizer.zero_grad()

            prediction = model(sample)
            loss = criterion(prediction, ground)
            loss.backward()
            optimizer.step()

And the obtained loss function:

I don´t know where the problem might be: data processing, model definition or the training process itself. Batch size is set to 8. I have checked out the model parameters are updating (slightly but not much).

Looking to the gradients at last layer at some point the are few values to 0.0 so, I don´t know if this could be the problem.

print(list(model.parameters())[6].grad)

tensor([[ 0.0032, -0.0025, 0.0000, 0.0000, 0.0025, -0.0045, 0.0214, 0.0219,
0.0102, 0.0052, -0.0004, -0.0027, 0.0000, -0.0109, 0.0008, 0.0020,
0.0000, 0.0024, 0.0000, 0.0161, 0.0000, 0.0041, 0.0000, -0.0058,
0.0078, -0.0006, 0.0000, 0.0060, 0.0049, 0.0027],
[-0.0032, 0.0025, 0.0000, 0.0000, -0.0025, 0.0045, -0.0214, -0.0219,
-0.0102, -0.0052, 0.0004, 0.0027, 0.0000, 0.0109, -0.0008, -0.0020,
0.0000, -0.0024, 0.0000, -0.0161, 0.0000, -0.0041, 0.0000, 0.0058,
-0.0078, 0.0006, 0.0000, -0.0060, -0.0049, -0.0027]])

Thanks in advance.

Your model is able to overfit random samples:

class CustomModel(nn.Module):
    def __init__(self, ):
        torch.manual_seed(3)
        super(CustomModel, self).__init__()
        self.fc1 = nn.Linear(7, 60)
        self.fc2 = nn.Linear(60, 100)
        self.fc3 = nn.Linear(100, 30)
        self.fc4 = nn.Linear(30, 2)
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.xavier_uniform_(module.weight.data)
            torch.nn.init.constant_(module.bias.data, 0)

    def forward(self, data):
        x = data
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = F.relu(self.fc3(x))
        x = self.fc4(x)
        return x
    
model = CustomModel()
optimizer = optim.SGD(model.parameters(), lr=0.0001)
criterion = nn.CrossEntropyLoss()

x = torch.randn(64, 7)
target = torch.randint(0, 2, (64,))

for epoch in range(100000):
    optimizer.zero_grad()
    out = model(x)
    loss = criterion(out, target)
    loss.backward()
    optimizer.step()
    print(f"epoc: {epoch}, loss: {loss:.5f}")

but it quite slow as the learning rate seems to be low.
I would also expect this behavior since your have a total of 60 * 7 = 420 input values with:

sum([p.nelement() for p in model.parameters()])
# 9672

trainable parameters.
You could use my code and should see a constant decrease in the loss value and eventually a 100% accuracy on this random training set.
With that being said, the number of samples is quite low and even if you can overfit the training set it might be tricky to generalize to the validation/test set.

Thanks @ptrblck!
It was indeed a matter of iterations (now 80000) for the loss to start to converge:

I dont know yet why this “spike” shape, but the training tendency is decreasing. And the train/test curves make sense with the model over fitting the train data.

imagen