Predictions Do Not Get Better Over Epochs - Loss Constant From First Epoch On

Frank_Hauser · May 10, 2021, 4:16pm

I am trying to train a NN to predict a made-up function (binary classification problem). In particular, the input features are 12 binary features and the label corresponds to one whenever the second, fifth, or sixth input features are one.

The NN structure looks like this:

class NeuralNework(nn.Module):
    def __init__(self):
        super(NeuralNework, self).__init__()
        #self.flatten = nn.Flatten()
        self.linear_relu_stack = nn.Sequential(
            nn.Linear(12, 12),
            nn.ReLU(),
            nn.Linear(12, 12),
            nn.ReLU(),
            nn.Linear(12, 1),
            nn.ReLU(),
        )

    def forward(self, x):
        #x = self.flatten(x)
        logits = self.linear_relu_stack(x)
        return logits

model = NeuralNework()

while the train loop is:

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.float())
        y.resize_((batch_size, 1))
        loss = loss_fn(pred, y.float())

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

As a loss function BCEWithLogitsLoss() is used and I tried both the SGD and the Adam optimizer. I believe the might be a problem with how gradients are maintained between epochs (they might be deleted which causes performance to remain the same for each epoch), however, I was not able to figure out what causes the issue.

The current BCE Loss is 0.693147 and I cannot believe the network could not perform better since the problem is so simple.

Any help is greatly appreciated
Thank you very much in advance

eqy · May 11, 2021, 12:00am

A loss of 0.693 means that the model is not doing any better than a coinflip which is consistent with not learning anything after the random initialization. I don’t see anything obvious that indicates why the model cannot learn more than random chance. Can you share some other parts of the code, such as where the optimizer is defined?

ptrblck · May 11, 2021, 7:26am

Once issue could be the last nn.ReLU, which would clip the logits to [0, +Inf] and could thus hamper the training, so you might want to remove it.

Frank_Hauser · May 26, 2021, 9:14am

It turned out the gradients were always pushed towards zero. Batch normalization and dropout solve the problem for me.