Costant loss on simple MLP

ireneferfo · December 19, 2023, 11:08am

Hello, I have the following net to perform a binary classification of some trajectories:

Architecture

class DCR(nn.Module):
    def __init__(self, kemb_size, nvar, points, device):
        super().__init__()
        self.phis = load_phis_dataset()
        self.kemb = get_kernel_embedding(self.phis, nvar, samples = kemb_size).to(device) # (concepts, kemb_size)
        _ = self.kemb.requires_grad_()
        
        self.fc1 = nn.Linear(kemb_size + (nvar*points), 64)
        self.fc2 = nn.Linear(64, 1)
        self.sigmoid = nn.Sigmoid()
                
    def forward(self, x):
        # concept truth degrees
        rhos = get_robustness(x, self.phis, time = False) # (trajectories, concepts)
        _ = rhos.requires_grad_()
        # embed trajectories in kernel space
        traj_emb = torch.matmul(rhos, self.kemb) # (trajectories, kemb_size)
        _ = traj_emb.requires_grad_()
        # combine info from traj_embed and x to predict class 
        x_new = x.view(x.size(0), -1) # flatten x
        combined_features = torch.cat((traj_emb, x_new), dim=1) # (trajectories, kemb_size + x.shape[0]*x.shape[1])

        output = self.fc1(combined_features)
        output = F.relu(output)
        output = self.fc2(output)
        output = self.sigmoid(output)
        
        return output.squeeze(1)

Training

model = DCR(kemb_size, nvar, points, device).to(device)
criterion = nn.BCELoss().to(device)
optimizer = optim.SGD(model.parameters(), lr=0.01)

model.train()
for epoch in range(10):
    epoch_loss = 0.0

    for batch, labels in train_loader:
        batch, labels = batch.to(device), labels.to(device)
                 
        y_preds = model(batch)
        loss = criterion(y_preds, labels.float())
        
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()
        
        epoch_loss += y_preds.shape[0] * loss.item()
  
    print(f'Epoch: {epoch}, Loss: {epoch_loss/len(y_train):.5f}')

However, the training loss remains perfectly constant at every epoch, and the weights are not updated. What could I be doing wrong?

I tried solving the problem with some .requires_grad_(), but it didn’t work.

Some explanations

phis = list of STL formulae
kemb = kernel embedding of said STL formulae
rhos = robustness of STL formulae on input trajectories

All shapes are as comments in the code.

ptrblck · December 19, 2023, 1:21pm

Remove the sigmoid and use nn.BCEWithLogitsLoss instead for a better numerical stability.

stijnvdl · December 19, 2023, 3:00pm

On a related note: if using BCEWithLogitsLoss for numerical stability, what would be the recommended way to still get the output of the sigmoid during inference? Just passing the output of the network through a torch.Tensor.sigmoid?

J_Johnson · December 19, 2023, 3:57pm

In the forward pass, you could just call:

if self.training:
    return x
else:
    return self.sigm(x)

J_Johnson · December 19, 2023, 4:09pm

@ireneferfo
There are several functions used in the forward pass that are undefined. It’s possible the break in logic happens in one of those. Can you share those here?

Also, have you tried seeing if the model would overfit to a small batch of randomized data and labels?

I.e.

batch = torch.rand(10, 100)
labels = torch.round(torch.rand(10)

for epoch in range(10):
    epoch_loss = 0.0
    for i in range(10000):
        #insert forward/backward pass in here

stijnvdl · December 20, 2023, 8:36am

Will this not go wrong when calculating the test/validation loss during training? Because it uses the same BCEWithLogitsLoss but the network would output the sigmoid, since you would call model.eval() to set self.training=False before evaluating the test/validation loss.

J_Johnson · December 20, 2023, 9:02am

I see. I guess it depends on how you’re evaluating your model. Otherwise, you could just call F.sigmoid(x, dim = 1) on your outputs when you need it.

ireneferfo · December 20, 2023, 9:23am

To share those functions I would need to share the whole repository of my research, and it’s way too big (and currently private). It is possibile that the logic breaks there, but I thought that if after the computations I’d call .requires_grad_() everything would be fine, being the outputs of those functions simple tensors of floats.

With a small batch of random data, without the sigmoid and with nn.BCEWithLogitsLoss, as suggested by @ptrblck, the model does train. I guess the problem really is in the ‘hidden’ code, which I didn’t write so it’ll be a mess to fix.

Thanks for the help!

J_Johnson · December 20, 2023, 4:04pm

Yes. It should still overfit a small set even with nn.BCELoss and a final sigmoid activation.

However, the nn.BCEWithLogitsLoss is a better habit to get into because it handles logits outside of the +10/-10 range better.

With the other preprocessing functions, you could try plotting out what those are giving for outputs to see if they are doing what you expect.