Predictions Do Not Get Better Over Epochs - Loss Constant From First Epoch On

I am trying to train a NN to predict a made-up function (binary classification problem). In particular, the input features are 12 binary features and the label corresponds to one whenever the second, fifth, or sixth input features are one.

The NN structure looks like this:

class NeuralNework(nn.Module):
    def __init__(self):
        super(NeuralNework, self).__init__()
        #self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(12, 12),
            nn.Linear(12, 12),
            nn.Linear(12, 1),

    def forward(self, x):
        #x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNework()

while the train loop is:

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.float())
        y.resize_((batch_size, 1))
        loss = loss_fn(pred, y.float())

        # Backpropagation

As a loss function BCEWithLogitsLoss() is used and I tried both the SGD and the Adam optimizer. I believe the might be a problem with how gradients are maintained between epochs (they might be deleted which causes performance to remain the same for each epoch), however, I was not able to figure out what causes the issue.

The current BCE Loss is 0.693147 and I cannot believe the network could not perform better since the problem is so simple.

Any help is greatly appreciated
Thank you very much in advance

A loss of 0.693 means that the model is not doing any better than a coinflip which is consistent with not learning anything after the random initialization. I don’t see anything obvious that indicates why the model cannot learn more than random chance. Can you share some other parts of the code, such as where the optimizer is defined?

1 Like

Once issue could be the last nn.ReLU, which would clip the logits to [0, +Inf] and could thus hamper the training, so you might want to remove it.

1 Like

It turned out the gradients were always pushed towards zero. Batch normalization and dropout solve the problem for me.