Model does not train: Same loss in every epoch

Hey everyone,

this is my second pytorch implementation so far, for my first implementation the same happend; the model does not learn anything and outputs the same loss and accuracy for every epoch and even for each batch with an epoch. My personal guess is that something with the way I feed the data to the model is not correctly implemented.

I try to follow the basic tutorial for implementing a pytorch model: Optimizing Model Parameters — PyTorch Tutorials 1.8.1+cu102 documentation

I read similar topics on thepytorch forum, e.g. Same values in every epoch when training

I’m using nn.BCEWithLogitsLoss() and already tried to overfit the model on a training sample of size 600.
It’s rather a simple model, so I’m a bit confused that it’s not working at all.
Find the code below:

Load the data

df = pd.read_csv(Path()/ "cleaned_data.csv", nrows=1000)
#delete id column
del df["Unnamed: 0"]
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

X_train = train.iloc[:,1:]
Y_train = train.iloc[:,1]

X_validate = validate.iloc[:,1:]
Y_validate = validate.iloc[:,1]

X_test = validate.iloc[:,1:]
Y_test = validate.iloc[:,1]

x_train_pt = torch.from_numpy(X_train.values)
y_train_pt = torch.from_numpy(Y_train.values.reshape((-1,1)))

x_val_pt = torch.from_numpy(X_validate.values)
y_val_pt = torch.from_numpy(Y_validate.values.reshape((-1,1)))

x_test_pt = torch.from_numpy(X_test.values)
y_test_pt = torch.from_numpy(Y_test.values.reshape((-1,1)))
print(x_train_pt.shape) #torch.Size([600, 22])
print(y_train_pt.shape) #torch.Size([600, 1])

batch_size=100
train_dataset = TensorDataset(x_train_pt,y_train_pt) # create your datset
trainloader = DataLoader(train_dataset, batch_size=batch_size) # create your dataloader

val_dataset = TensorDataset(x_val_pt,y_val_pt)
valloader = DataLoader(val_dataset, batch_size=batch_size)

test_dataset = TensorDataset(x_test_pt,y_test_pt)
testloader = DataLoader(test_dataset, batch_size=batch_size)

The Model

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

def weights_init(m):
  if isinstance(m, nn.Linear):
      nn.init.uniform_(m.weight.data, -1,1)
      nn.init.zeros_(m.bias.data)

model.apply(weights_init)

Training

loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.float())
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X.float())
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= size
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")
model.train()
start = time.time()
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(trainloader, model, loss_fn, optimizer)
    test_loop(valloader, model, loss_fn)
end = time.time()
print(f"total training time in minutes: {(end-start)/60}")

Output

Epoch 1
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 2
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 3
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 4
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 5
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
total training time in minutes: 0.011684314409891764

The model does not always output exactly the same loss, but the fact that for every epoch the losses are the same does not change.
So the output from another run looked as following:

Epoch 1
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968 

Epoch 2
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968 

Epoch 3
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968

Thanks for your replies, any help is very much appreciated

nn.BCEWithLogitsLoss expects logits as the model output, not probabilities, so you would have to remove the last sigmoid activation in your model. Besides that I don’t see any obvious issues in your code.

Thanks for the fast reply.

If I deactivate the Sigmoid, the model indeed starts learning, though not really good. Besides that it starts outputting numbers outside of the interval [0,1].
But my true labels are 0 or 1 in a Tensor of shape [batch_size, 1], so coming from Keras I chose Sigmoid as I want the output to be a probability between 0 and 1.

So which loss function would I have to choose for the model to work? Or should I remodel the true labels to [batch_size, 2], i.e. using one hot encodings, and then use simple BinaryCrossEntropy loss?

Logits are unbound and can contain any values in [-Inf, Inf]. nn.BCEWithLogitsLoss will apply the sigmoid internally in a numerically stable way, so the model outputs are expected to contain positive and negative values.
You could use the (numerically less stable) nn.BCELoss with a sigmoid activation if you prefer it.

Add to what @ptrblck says, keep a shape of [batch_size, 1]. And to know the output of your model, you can, once you have your logits (of shape [batch_size, 1] also), do :

import torch
torch.manual_seed(0)
batch_size = 10
logits = torch.empty(batch_size, dtype=torch.float).uniform_(-10, 10)

pred = (logits >= 0.).int()
# Or if you prefer to use the sigmoid (x > 0 <====> 1 / (1+e-x) > 0.5 )
pred = torch.sigmoid(logits).round().int() # = (torch.sigmoid(logits) >= 0.5).int()

If you switch to a shape of [batch_size, 2] and use nn.BCEWithLogitsLoss, it will be like doing multi-label classification.

I tried both of your proposal, but with no success. Although the loss changes, the pattern still remains the same across all epochs. It doesn’t matter if I don’t use nn.Sigmoid() as the last activation function and use nn.BCEWithLogitsLoss or no sigmoid and nn.BCELoss. Though I did not really understand how to incorporate what @pascal_notsawo suggested. I tried:

pred = model(X)
pred = torch.sigmoid(pred).round()
loss = loss_fn(pred, y)

This is the training performance for Sigmoid activation with BCELoss, also the accuracy of 1500% indicates that something is really wrong:

Epoch 1
-------------------------------
loss: 11.000000  [    0/  600]
loss: 21.000000  [  100/  600]
loss: 25.000000  [  200/  600]
loss: 22.000000  [  300/  600]
loss: 19.000000  [  400/  600]
loss: 14.000000  [  500/  600]
Test Error: Accuracy: 1500.0%, Avg loss: 0.150000 
...
Epoch 5
-------------------------------
loss: 11.000000  [    0/  600]
loss: 21.000000  [  100/  600]
loss: 25.000000  [  200/  600]
loss: 22.000000  [  300/  600]
loss: 19.000000  [  400/  600]
loss: 14.000000  [  500/  600]
Test Error: 
 Accuracy: 1500.0%, Avg loss: 0.150000

When I monitor the output of the model during training, I see that it correctly outputs a Tensor of shape [batch_size, 1] containing 0 or 1. I don’t really understand this output, as I use the sigmoid activation I would expect a number between 0 and 1 as output, but not exclusively 0 or 1.
I further observed that the output generated by the model for the last batch in epoch 1 is identical to the output generated for the same batch in the last epoch. (I collected the outputs and compared the tensors after training) This seems to indicate to me, that the model does not update its parameters.

So my actual question is at the moment, what am I doing wrong that the model does not output values between 0 and 1, but 0 or 1? If it would, using nn.BCELoss should be the right way to go. The same model architecture with BCELoss works btw in Keras.

I’d be open to suggestion on how to change the architecture of the model to make it work; my data looks as following and the task is to predict the label of loan_status (0 or 1). (The data is not completely cleaned and preprocessed yet)

df.head()

loan_status int_rate installment annual_inc loan_amnt dti open_acc pub_rec revol_bal revol_util seniority term emp_length home_ownership verification_status purpose initial_list_status application_type address issue_d
0 1 -0.491799 -0.408291 117000.0 -0.492243 1.194178 0.913445 -0.34997 1.340611 -0.491258 0.357143 -1 0 2 -1 10 1 1 22690 2015
1 1 -0.368816 -0.662750 65000.0 -0.731551 0.646367 1.108239 -0.34997 -0.174517 -0.019706 0.157143 -1 1 0 -1 2 -1 1 5113 2015
2 1 -0.704225 0.299609 43057.0 0.177819 -0.564308 0.329061 -0.34997 -0.813871 1.575371 0.114286 -1 1 2 1 1 -1 1 5113 2015
3 1 -1.598649 -0.842348 54000.0 -0.827274 -1.896573 -1.034500 -0.34997 -0.495024 -1.323650 0.114286 -1 1 2 -1 1 -1 1 813 2014
4 0 0.811824 0.707861 55000.0 1.227783 1.685770 0.329061 -0.34997 -0.465887 0.656869 0.200000 1 1 0 1 1 -1 1 11650 2013

@melste what I gave as code was to be able to calculate things like accuracy, precision and other … Otherwise for the calculation of the loss, it remains :

  1. either you remove the Sigmoid layer at the output of the model : advised, because more stable
loss_fn = nn.BCEWithLogitsLoss()

logits = model(...your inputs...)

loss = loss_fn(logits, y)

Then you can apply the sigmoid to the logits output as I mentioned in my post above, to know if your model has predicted 0 or 1.

prediction = (logits >= 0.).int()
# Or if you prefer to use the sigmoid (x > 0 <====> 1 / (1+e-x) > 0.5 )
probability  = torch.sigmoid(logits)
prediction = probability.round().int() # = (probability >= 0.5).int()
  1. or you keep this sigmoid layer
loss_fn = nn.BCELoss()

probability = model(...your inputs...)

loss = loss_fn(probability, y)

In this case your model returns directly a probability (between [0,1]), that you can also compare to 0.5 to know if your model has predicted 0 or 1.

prediction = probability.round().int() # = (probability >= 0.5).int()

I think I understood your answer now and why using nn.BCEWithLogitsLoss() together with Sigmoid is a bad idea. Thanks a lot!

So I’d like to do the second proposal. But what I don’t understand is, why the model outputs 0 or 1 and not values between 0 and 1.
Output of the model (i.e. pred) looks like this:

tensor([[1.],
        [0.],
        [0.],
        [0.],
        ...
        [0.],
        [1.],
        [0.],
        [0.],
        [1.]], grad_fn=<SigmoidBackward>)

I would like to have an output as in keras:

[4.5889354e-01],
       [1.0000000e+00],
       [9.9999785e-01],
       ...
       [8.9359963e-01],
       [1.0000000e+00],
       [2.6206519e-05],
       [1.0000000e+00],
       [1.0000000e+00]], dtype=float32)>

It seems you are using .round() in the wrong place.
As mentioned in each of my comments, knowing that the probability (after the sigmoid) is in the range [0,1], probability.round() is just an elegant way of doing probability >= 0.5, which returns 0. (False) if the value is in the range [0, 0.5[ and 1. (True) if the value is in the interval [0.5, 1]
And in both cases, I do .int() to convert them to integers (0. becomes 0 and 1. becomes 1)
So don’t use it anyhow.

It is important to distinguish between logits, probabilities and the final prediction of the model.

BCE with logits waits for the logits, BCE simply waits for the probabilities, acc_score … waits for the predictions

I don’t use .round() anywhere. I’m using what you suggested for nn.BCELoss().

So training loop is simple:
The output I reported in the post before was pred in the training loop.

for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

The model is the same as described in my first post.

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

This is the value of which variable in your pipeline?

It’s pred in the training loop. So far I have seen tensor with only 0, only 1, 0 or 1, but never a value in between…
I collected it this way:

predictions = []
...
for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        predictions.append(pred)
        loss = loss_fn(pred, y)
        ....

Does anyone have another suggestion?
I also tried to run the same model on GoogleColab to check whether something on my device is wrong, but the performance is the same on GoogleColab.

So far I could narrow the problem down to the fact that the predicted output of the model is identical for each batch out of each epoch.
The predictions of the model is always a tensor with exclusively zeros and/or ones.
The same model (identical architecture) with exactly the same data performs ok when I use Keras. So the data does not seem to be the problem.
If I deactivate the Sigmoid activation of the last layer and use BCEWithLogitsLoss, the model does not get stuck.

Can you share the link to your colab notebook?

Hey, sure. Here comes the link: Google Colaboratory

Though I don’t really know how to make the data accessible for you…

Could you find an error in the pytorch implementation?

Your model is generally able to overfit a small dataset as seen here:

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

def weights_init(m):
  if isinstance(m, nn.Linear):
      nn.init.uniform_(m.weight.data, -1,1)
      nn.init.zeros_(m.bias.data)
model.apply(weights_init)

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

data = torch.randn(8, 22)
target = torch.randint(0, 2, (8, 1)).float()

for epoch in range(100):
    pred = model(data)
    loss = loss_fn(pred, target)
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print('epoch {}, loss {}, acc {}'.format(
        epoch, loss.item(), ((pred>0.0)==target).float().mean()))

if you use nn.BCEWithLogitsLoss (in your current code snippet it seems you’ve experimented with different approaches such as using nn.BCELoss + sigmoid(output)), so you could also try to overfit a small dataset using the real data by playing around with some hyperparameters.