Model does not train: Same loss in every epoch

melste · May 16, 2021, 4:50pm

Hey everyone,

this is my second pytorch implementation so far, for my first implementation the same happend; the model does not learn anything and outputs the same loss and accuracy for every epoch and even for each batch with an epoch. My personal guess is that something with the way I feed the data to the model is not correctly implemented.

I try to follow the basic tutorial for implementing a pytorch model: Optimizing Model Parameters — PyTorch Tutorials 1.8.1+cu102 documentation

I read similar topics on thepytorch forum, e.g. Same values in every epoch when training

I’m using nn.BCEWithLogitsLoss() and already tried to overfit the model on a training sample of size 600.
It’s rather a simple model, so I’m a bit confused that it’s not working at all.
Find the code below:

Load the data

df = pd.read_csv(Path()/ "cleaned_data.csv", nrows=1000)
#delete id column
del df["Unnamed: 0"]
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

X_train = train.iloc[:,1:]
Y_train = train.iloc[:,1]

X_validate = validate.iloc[:,1:]
Y_validate = validate.iloc[:,1]

X_test = validate.iloc[:,1:]
Y_test = validate.iloc[:,1]

x_train_pt = torch.from_numpy(X_train.values)
y_train_pt = torch.from_numpy(Y_train.values.reshape((-1,1)))

x_val_pt = torch.from_numpy(X_validate.values)
y_val_pt = torch.from_numpy(Y_validate.values.reshape((-1,1)))

x_test_pt = torch.from_numpy(X_test.values)
y_test_pt = torch.from_numpy(Y_test.values.reshape((-1,1)))
print(x_train_pt.shape) #torch.Size([600, 22])
print(y_train_pt.shape) #torch.Size([600, 1])

batch_size=100
train_dataset = TensorDataset(x_train_pt,y_train_pt) # create your datset
trainloader = DataLoader(train_dataset, batch_size=batch_size) # create your dataloader

val_dataset = TensorDataset(x_val_pt,y_val_pt)
valloader = DataLoader(val_dataset, batch_size=batch_size)

test_dataset = TensorDataset(x_test_pt,y_test_pt)
testloader = DataLoader(test_dataset, batch_size=batch_size)

The Model

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

def weights_init(m):
  if isinstance(m, nn.Linear):
      nn.init.uniform_(m.weight.data, -1,1)
      nn.init.zeros_(m.bias.data)

model.apply(weights_init)

Training

loss_fn = nn.BCEWithLogitsLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)

def train_loop(dataloader, model, loss_fn, optimizer):
    size = len(dataloader.dataset)
    for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X.float())
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")


def test_loop(dataloader, model, loss_fn):
    size = len(dataloader.dataset)
    test_loss, correct = 0, 0

    with torch.no_grad():
        for X, y in dataloader:
            pred = model(X.float())
            test_loss += loss_fn(pred, y).item()
            correct += (pred.argmax(1) == y).type(torch.float).sum().item()

    test_loss /= size
    correct /= size
    print(f"Test Error: \n Accuracy: {(100*correct):>0.1f}%, Avg loss: {test_loss:>8f} \n")

model.train()
start = time.time()
for t in range(epochs):
    print(f"Epoch {t+1}\n-------------------------------")
    train_loop(trainloader, model, loss_fn, optimizer)
    test_loop(valloader, model, loss_fn)
end = time.time()
print(f"total training time in minutes: {(end-start)/60}")

Output

Epoch 1
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 2
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 3
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 4
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
Epoch 5
-------------------------------
loss: 0.693147  [    0/  600]
loss: 0.693147  [  100/  600]
loss: 0.693147  [  200/  600]
loss: 0.693147  [  300/  600]
loss: 0.693147  [  400/  600]
loss: 0.693147  [  500/  600]
total training time in minutes: 0.011684314409891764

The model does not always output exactly the same loss, but the fact that for every epoch the losses are the same does not change.
So the output from another run looked as following:

Epoch 1
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968 

Epoch 2
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968 

Epoch 3
-------------------------------
loss: 1.026536  [    0/  600]
loss: 0.980619  [  100/  600]
loss: 0.964926  [  200/  600]
loss: 1.120505  [  300/  600]
loss: 1.092735  [  400/  600]
loss: 0.910104  [  500/  600]
Test Error: 
 Accuracy: 0.0%, Avg loss: 0.010968

Thanks for your replies, any help is very much appreciated

ptrblck · May 17, 2021, 12:08am

nn.BCEWithLogitsLoss expects logits as the model output, not probabilities, so you would have to remove the last sigmoid activation in your model. Besides that I don’t see any obvious issues in your code.

melste · May 17, 2021, 7:48am

Thanks for the fast reply.

If I deactivate the Sigmoid, the model indeed starts learning, though not really good. Besides that it starts outputting numbers outside of the interval [0,1].
But my true labels are 0 or 1 in a Tensor of shape [batch_size, 1], so coming from Keras I chose Sigmoid as I want the output to be a probability between 0 and 1.

So which loss function would I have to choose for the model to work? Or should I remodel the true labels to [batch_size, 2], i.e. using one hot encodings, and then use simple BinaryCrossEntropy loss?

ptrblck · May 17, 2021, 7:51am

Logits are unbound and can contain any values in [-Inf, Inf]. nn.BCEWithLogitsLoss will apply the sigmoid internally in a numerically stable way, so the model outputs are expected to contain positive and negative values.
You could use the (numerically less stable) nn.BCELoss with a sigmoid activation if you prefer it.

pascal_notsawo · May 17, 2021, 8:41am

Add to what @ptrblck says, keep a shape of [batch_size, 1]. And to know the output of your model, you can, once you have your logits (of shape [batch_size, 1] also), do :

import torch
torch.manual_seed(0)
batch_size = 10
logits = torch.empty(batch_size, dtype=torch.float).uniform_(-10, 10)

pred = (logits >= 0.).int()
# Or if you prefer to use the sigmoid (x > 0 <====> 1 / (1+e-x) > 0.5 )
pred = torch.sigmoid(logits).round().int() # = (torch.sigmoid(logits) >= 0.5).int()

If you switch to a shape of [batch_size, 2] and use nn.BCEWithLogitsLoss, it will be like doing multi-label classification.

melste · May 17, 2021, 11:54am

I tried both of your proposal, but with no success. Although the loss changes, the pattern still remains the same across all epochs. It doesn’t matter if I don’t use nn.Sigmoid() as the last activation function and use nn.BCEWithLogitsLoss or no sigmoid and nn.BCELoss. Though I did not really understand how to incorporate what @pascal_notsawo suggested. I tried:

pred = model(X)
pred = torch.sigmoid(pred).round()
loss = loss_fn(pred, y)

This is the training performance for Sigmoid activation with BCELoss, also the accuracy of 1500% indicates that something is really wrong:

Epoch 1
-------------------------------
loss: 11.000000  [    0/  600]
loss: 21.000000  [  100/  600]
loss: 25.000000  [  200/  600]
loss: 22.000000  [  300/  600]
loss: 19.000000  [  400/  600]
loss: 14.000000  [  500/  600]
Test Error: Accuracy: 1500.0%, Avg loss: 0.150000 
...
Epoch 5
-------------------------------
loss: 11.000000  [    0/  600]
loss: 21.000000  [  100/  600]
loss: 25.000000  [  200/  600]
loss: 22.000000  [  300/  600]
loss: 19.000000  [  400/  600]
loss: 14.000000  [  500/  600]
Test Error: 
 Accuracy: 1500.0%, Avg loss: 0.150000

When I monitor the output of the model during training, I see that it correctly outputs a Tensor of shape [batch_size, 1] containing 0 or 1. I don’t really understand this output, as I use the sigmoid activation I would expect a number between 0 and 1 as output, but not exclusively 0 or 1.
I further observed that the output generated by the model for the last batch in epoch 1 is identical to the output generated for the same batch in the last epoch. (I collected the outputs and compared the tensors after training) This seems to indicate to me, that the model does not update its parameters.

So my actual question is at the moment, what am I doing wrong that the model does not output values between 0 and 1, but 0 or 1? If it would, using nn.BCELoss should be the right way to go. The same model architecture with BCELoss works btw in Keras.

I’d be open to suggestion on how to change the architecture of the model to make it work; my data looks as following and the task is to predict the label of loan_status (0 or 1). (The data is not completely cleaned and preprocessed yet)

df.head()

	loan_status	int_rate	installment	annual_inc	loan_amnt	dti	open_acc	pub_rec	revol_bal	revol_util	…	seniority	term	emp_length	home_ownership	verification_status	purpose	initial_list_status	application_type	address	issue_d
0	1	-0.491799	-0.408291	117000.0	-0.492243	1.194178	0.913445	-0.34997	1.340611	-0.491258	…	0.357143	-1	0	2	-1	10	1	1	22690	2015
1	1	-0.368816	-0.662750	65000.0	-0.731551	0.646367	1.108239	-0.34997	-0.174517	-0.019706	…	0.157143	-1	1	0	-1	2	-1	1	5113	2015
2	1	-0.704225	0.299609	43057.0	0.177819	-0.564308	0.329061	-0.34997	-0.813871	1.575371	…	0.114286	-1	1	2	1	1	-1	1	5113	2015
3	1	-1.598649	-0.842348	54000.0	-0.827274	-1.896573	-1.034500	-0.34997	-0.495024	-1.323650	…	0.114286	-1	1	2	-1	1	-1	1	813	2014
4	0	0.811824	0.707861	55000.0	1.227783	1.685770	0.329061	-0.34997	-0.465887	0.656869	…	0.200000	1	1	0	1	1	-1	1	11650	2013

pascal_notsawo · May 17, 2021, 12:41pm

@melste what I gave as code was to be able to calculate things like accuracy, precision and other … Otherwise for the calculation of the loss, it remains :

either you remove the Sigmoid layer at the output of the model : advised, because more stable

loss_fn = nn.BCEWithLogitsLoss()

logits = model(...your inputs...)

loss = loss_fn(logits, y)

Then you can apply the sigmoid to the logits output as I mentioned in my post above, to know if your model has predicted 0 or 1.

prediction = (logits >= 0.).int()
# Or if you prefer to use the sigmoid (x > 0 <====> 1 / (1+e-x) > 0.5 )
probability  = torch.sigmoid(logits)
prediction = probability.round().int() # = (probability >= 0.5).int()

or you keep this sigmoid layer

loss_fn = nn.BCELoss()

probability = model(...your inputs...)

loss = loss_fn(probability, y)

In this case your model returns directly a probability (between [0,1]), that you can also compare to 0.5 to know if your model has predicted 0 or 1.

prediction = probability.round().int() # = (probability >= 0.5).int()

melste · May 17, 2021, 12:53pm

I think I understood your answer now and why using nn.BCEWithLogitsLoss() together with Sigmoid is a bad idea. Thanks a lot!

So I’d like to do the second proposal. But what I don’t understand is, why the model outputs 0 or 1 and not values between 0 and 1.
Output of the model (i.e. pred) looks like this:

tensor([[1.],
        [0.],
        [0.],
        [0.],
        ...
        [0.],
        [1.],
        [0.],
        [0.],
        [1.]], grad_fn=<SigmoidBackward>)

I would like to have an output as in keras:

[4.5889354e-01],
       [1.0000000e+00],
       [9.9999785e-01],
       ...
       [8.9359963e-01],
       [1.0000000e+00],
       [2.6206519e-05],
       [1.0000000e+00],
       [1.0000000e+00]], dtype=float32)>

pascal_notsawo · May 17, 2021, 1:02pm

It seems you are using .round() in the wrong place.
As mentioned in each of my comments, knowing that the probability (after the sigmoid) is in the range [0,1], probability.round() is just an elegant way of doing probability >= 0.5, which returns 0. (False) if the value is in the range [0, 0.5[ and 1. (True) if the value is in the interval [0.5, 1]
And in both cases, I do .int() to convert them to integers (0. becomes 0 and 1. becomes 1)
So don’t use it anyhow.

It is important to distinguish between logits, probabilities and the final prediction of the model.

BCE with logits waits for the logits, BCE simply waits for the probabilities, acc_score … waits for the predictions

melste · May 17, 2021, 1:18pm

I don’t use .round() anywhere. I’m using what you suggested for nn.BCELoss().

So training loop is simple:
The output I reported in the post before was pred in the training loop.

for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        loss = loss_fn(pred, y)

        # Backpropagation
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        #if batch % 100 == 0:
        loss, current = loss.item(), batch * len(X)
        print(f"loss: {loss:>7f}  [{current:>5d}/{size:>5d}]")

The model is the same as described in my first post.

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
            nn.Sigmoid(),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

pascal_notsawo · May 17, 2021, 1:21pm

This is the value of which variable in your pipeline?

melste · May 17, 2021, 1:25pm

It’s pred in the training loop. So far I have seen tensor with only 0, only 1, 0 or 1, but never a value in between…
I collected it this way:

predictions = []
...
for batch, (X, y) in enumerate(dataloader):
        # Compute prediction and loss
        pred = model(X)
        predictions.append(pred)
        loss = loss_fn(pred, y)
        ....

melste · May 20, 2021, 12:24pm

Does anyone have another suggestion?
I also tried to run the same model on GoogleColab to check whether something on my device is wrong, but the performance is the same on GoogleColab.

So far I could narrow the problem down to the fact that the predicted output of the model is identical for each batch out of each epoch.
The predictions of the model is always a tensor with exclusively zeros and/or ones.
The same model (identical architecture) with exactly the same data performs ok when I use Keras. So the data does not seem to be the problem.
If I deactivate the Sigmoid activation of the last layer and use BCEWithLogitsLoss, the model does not get stuck.

pascal_notsawo · May 21, 2021, 7:50am

Can you share the link to your colab notebook?

melste · May 25, 2021, 9:45am

Hey, sure. Here comes the link: Google Colaboratory

Though I don’t really know how to make the data accessible for you…

melste · June 22, 2021, 11:04am

Could you find an error in the pytorch implementation?

ptrblck · June 23, 2021, 6:45am

Your model is generally able to overfit a small dataset as seen here:

class LendingClub(nn.Module):
    def __init__(self):
        super(LendingClub, self).__init__()
        self.linear = nn.Sequential(
            nn.Linear(22, 100),
            nn.ReLU(),
            nn.Linear(100, 64),
            nn.ReLU(),
            nn.Linear(64, 32),
            nn.ReLU(),
            nn.Linear(32, 16),
            nn.ReLU(),
            nn.Linear(16, 8),
            nn.ReLU(),
            nn.Linear(8, 1),
        )

    def forward(self, x):
        return self.linear(x)
    
model = LendingClub()

def weights_init(m):
  if isinstance(m, nn.Linear):
      nn.init.uniform_(m.weight.data, -1,1)
      nn.init.zeros_(m.bias.data)
model.apply(weights_init)

loss_fn = nn.BCEWithLogitsLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)

data = torch.randn(8, 22)
target = torch.randint(0, 2, (8, 1)).float()

for epoch in range(100):
    pred = model(data)
    loss = loss_fn(pred, target)
        
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
    print('epoch {}, loss {}, acc {}'.format(
        epoch, loss.item(), ((pred>0.0)==target).float().mean()))

if you use nn.BCEWithLogitsLoss (in your current code snippet it seems you’ve experimented with different approaches such as using nn.BCELoss + sigmoid(output)), so you could also try to overfit a small dataset using the real data by playing around with some hyperparameters.