Costant loss on simple MLP

Hello, I have the following net to perform a binary classification of some trajectories:


class DCR(nn.Module):
    def __init__(self, kemb_size, nvar, points, device):
        self.phis = load_phis_dataset()
        self.kemb = get_kernel_embedding(self.phis, nvar, samples = kemb_size).to(device) # (concepts, kemb_size)
        _ = self.kemb.requires_grad_()
        self.fc1 = nn.Linear(kemb_size + (nvar*points), 64)
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()
    def forward(self, x):
        # concept truth degrees
        rhos = get_robustness(x, self.phis, time = False) # (trajectories, concepts)
        _ = rhos.requires_grad_()
        # embed trajectories in kernel space
        traj_emb = torch.matmul(rhos, self.kemb) # (trajectories, kemb_size)
        _ = traj_emb.requires_grad_()
        # combine info from traj_embed and x to predict class 
        x_new = x.view(x.size(0), -1) # flatten x
        combined_features =, x_new), dim=1) # (trajectories, kemb_size + x.shape[0]*x.shape[1])

        output = self.fc1(combined_features)
        output = F.relu(output)
        output = self.fc2(output)
        output = self.sigmoid(output)
        return output.squeeze(1)


model = DCR(kemb_size, nvar, points, device).to(device)
criterion = nn.BCELoss().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)

for epoch in range(10):
    epoch_loss = 0.0

    for batch, labels in train_loader:
        batch, labels =,
        y_preds = model(batch)
        loss = criterion(y_preds, labels.float())
        epoch_loss += y_preds.shape[0] * loss.item()
    print(f'Epoch: {epoch}, Loss: {epoch_loss/len(y_train):.5f}')

However, the training loss remains perfectly constant at every epoch, and the weights are not updated. What could I be doing wrong?

I tried solving the problem with some .requires_grad_(), but it didn’t work.

Some explanations

  • phis = list of STL formulae
  • kemb = kernel embedding of said STL formulae
  • rhos = robustness of STL formulae on input trajectories

All shapes are as comments in the code.

Remove the sigmoid and use nn.BCEWithLogitsLoss instead for a better numerical stability.

On a related note: if using BCEWithLogitsLoss for numerical stability, what would be the recommended way to still get the output of the sigmoid during inference? Just passing the output of the network through a torch.Tensor.sigmoid?

In the forward pass, you could just call:

    return x
    return self.sigm(x)
1 Like

There are several functions used in the forward pass that are undefined. It’s possible the break in logic happens in one of those. Can you share those here?

Also, have you tried seeing if the model would overfit to a small batch of randomized data and labels?


batch = torch.rand(10, 100)
labels = torch.round(torch.rand(10)

for epoch in range(10):
    epoch_loss = 0.0
    for i in range(10000):
        #insert forward/backward pass in here

Will this not go wrong when calculating the test/validation loss during training? Because it uses the same BCEWithLogitsLoss but the network would output the sigmoid, since you would call model.eval() to set before evaluating the test/validation loss.

I see. I guess it depends on how you’re evaluating your model. Otherwise, you could just call F.sigmoid(x, dim = 1) on your outputs when you need it.

1 Like

To share those functions I would need to share the whole repository of my research, and it’s way too big (and currently private). It is possibile that the logic breaks there, but I thought that if after the computations I’d call .requires_grad_() everything would be fine, being the outputs of those functions simple tensors of floats.

With a small batch of random data, without the sigmoid and with nn.BCEWithLogitsLoss, as suggested by @ptrblck, the model does train. I guess the problem really is in the ‘hidden’ code, which I didn’t write so it’ll be a mess to fix.

Thanks for the help!

Yes. It should still overfit a small set even with nn.BCELoss and a final sigmoid activation.

However, the nn.BCEWithLogitsLoss is a better habit to get into because it handles logits outside of the +10/-10 range better.

With the other preprocessing functions, you could try plotting out what those are giving for outputs to see if they are doing what you expect.