Pytorch nn Loss does not decrease

Alice_NL · September 29, 2020, 11:31am

Dear community,

I am working on my first ever NN for binary classification and I am stuck for two weeks now. It performs very bad and my loss gets stuck at high level. I have X_train.shape torch.Size([38201, 129, 39]) and y_train.shape torch.Size([4927929]). I intend to use 129 rows as a batch (these are records per patient). 129 is the max number of rows, while for those where there are less rows I paded with zero rows. As the data is very class imbalanced I use nn.CrossEntropyLoss with weights (BCE function does not allow to add weights).

I very hope that an experienced look might spot some “silly mistakes” in what I do. Thank you in advance.

My network:

class Net(nn.Module):
    def __init__(self, input_size, hidden_size, output_size):
        super(Net, self).__init__()
        self.input_size = input_size
        self.hidden_size = hidden_size 
        self.output_size = output_size
    
        self.activation = nn.Sigmoid() 
        self.fc1 = nn.Linear(self.input_size, self.hidden_size) 
    
        self.fc2 = nn.LSTM(self.hidden_size, self.output_size,num_layers=2, batch_first = True) # returns a tuple 
    
    def forward(self, x):
        h = self.fc1(x) 
        h, _ = self.fc2(h) 
        h = self.activation(h)
        return h

input_size = X_train.shape[2] # number of features = 39
hidden_size = round(input_size/2) # half of features as they are sparse
output_size = 2 # I put 2 not 1, as loss function requires this shape
weights = torch.tensor([zeroes/zeroes, zeroes/ones]) 
criterion = nn.CrossEntropyLoss(weight = weights)
optimizer = optim.Adam(net.parameters(), lr=0.0005)

TRAINING LOOP:

y_train = y_train.long() 
y_val = y_val.long() 

epochs = 15

train_losses, validation_losses = [], []

for epoch in range(epochs):  
    print ("\n Epoch [%d] out of %d" % (epoch + 1, epochs))

    running_loss = 0.0
    validation_loss = 0.0
    auc = 0.0
    pr_auc = 0.0

    for phase in ['train', 'validation']:
        if phase == 'train':
            net.train()
        else:
            net.eval()
        if phase == 'train':
            start = 0
            start_y = 0
            for i in range(X_train.shape[0]):
                optimizer.zero_grad() # zero the gradient buffers not to consider gradients of previous iter.
                end = start+1
                X_batch = X_train[start:end]
                start +=1 
                # for Cross Entropy Loss the shape of y should be different, thus ….:
                end_y = start_y+batch_size
                y_batch = y_train[start_y:end_y]
                start_y += batch_size

                # forward + backward + optimize
                outputs = net(X_batch)
                outputs = outputs.view(batch_size,output_size)
                loss = criterion(outputs, y_batch)
                loss.backward()
                optimizer.step() # Does the update
                running_loss += loss.item()

        if phase == 'validation':
            net.eval()
            nan = 0
            with torch.no_grad():
                vx_start, vy_start = 0,0
                for inputs in range(X_val.shape[0]):
                    vx_end = vx_start+1
                    vX_batch = X_val[vx_start:vx_end]
                    vx_start +=1 
                    # for Cross Entropy Loss the shape of y should be different, thus...:
                    vy_end = vy_start+batch_size
                    vy_batch = y_val[vy_start:vy_end]
                    vy_start += batch_size
                
                    inputs, labels = vX_batch,vy_batch
                    v_output = net(inputs)
                    v_output = v_output.view(batch_size,output_size)
                    v_loss = criterion(v_output, labels)
                    validation_loss += v_loss.item()
                   
                
                print(f"Training loss: {running_loss/X_train.shape[0]:.3f}.. "
                      f"Validation loss: {validation_loss/X_val.shape[0]:.3f}.. ")
                if epoch%10==0:
                    out = net(X_val)
                    out = out.view(-1,2)
                    out = out[:,1]
                    out = np.where(out > 0.5, 1, 0)
                    pr_auc = average_precision_score(y_val,out)
                    print(f"PR AUC: {pr_auc:.3f} ")

                    #f"Test accuracy: {accuracy/len(testloader):.3f}")
                validation_losses.append(validation_loss/X_val.shape[0]) 
                train_losses.append(running_loss/X_train.shape[0])

OUTPUT

Epoch [1] out of 15
Training loss: 0.605… Validation loss: 0.596…
PR AUC: 0.177

Epoch [2] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [3] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [4] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [5] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [6] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [7] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [8] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [9] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [10] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [11] out of 15
Training loss: 0.596… Validation loss: 0.596…
PR AUC: 0.176

Epoch [12] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [13] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [14] out of 15
Training loss: 0.596… Validation loss: 0.596…

Epoch [15] out of 15
Training loss: 0.596… Validation loss: 0.596…
Finished Training
starttime = 2020-09-29 13:12:03.304423
now = 2020-09-29 13:26:43.455491

tom · September 29, 2020, 1:07pm

I’m not quite sure what the sequence dimension here is (if you use LSTMs). It would seem you are feeding 1-element sequences? In that case, Linear might be what you need, but I feel I’m missing something.
The generic recommendation in a situation like this is to try to feed in the same batch over and over again and see if the loss decreases (it should), i.e. to see if your network can overfit.

Best regards

Thomas

Alice_NL · September 29, 2020, 1:12pm

Dear Thomas, thank you for your time looking into this. I have a table with patients having multiple rows of information that represent measurements taken at different times (time series). I thought it would make sense to feed 1 patient at a time (batch is one patient).

I will now try to do what you recommend to test if my network overfits (As it should) and will respond here upon result.

Best,
Alice

Alice_NL · September 29, 2020, 1:22pm

update:
Dear Thomas,

When I feed one batch, the training loss beautifully decreases till zero (as expected).

So, unfortunately, I am still clueless where and what I do wrong…

tom · September 29, 2020, 1:25pm

Dear Alice,

I’m a bit dense, and don’t understand the sizes in terms of patient, time, and features.
When you have X_batch = X_train[start:end], you have X_batch of shape 1 x 129 x 39 with 1 the sequence length. Do you intend to use 129 as the sequence length? Then you would need to instantiate the LSTM with batch_first=True as a parameter.
But then the outputs would have 1 x 129 x 2 (I think) and you would need to take the average or so over time. I’m not sure I understand what the shape of Y_batch is without the batch size.

Best regards

Thomas

Edit: Maybe it you could print the tensor shapes going in and out of the net and into the criterion to see.

Alice_NL · September 29, 2020, 1:44pm

Dear Thomas,

You are right (sorry for not being clear, I am not 100% sure what I am doing and how it is called, as it is my first time. It is finally clear what sequence is.) : I have X_batch of shape 1 x 129 x 39 and I intend to use 129 as the sequence length. So I add batch_first=True , as you advised, and will try to make it work. Now I finally understand what the phrase in documentation meant… batch_first – If True , then the input and output tensors are provided as (batch, seq, feature).

For Y I have a label for each time stamp. So in case of X_batch of shape 1 x 129 x 39 I have 129 labels related to it.

At the moment I just set it to run with batch first true and I am on the epoch 1 stuck for a while (so it got much slower if will work at all). I will report here on the results.

Best regards, Alice

Alice_NL · September 29, 2020, 4:17pm

update:
I only added batch first and it works. But takes much longer. I provide the output below (got a bit better but still loss gets stuck).
Thank you for spotting this mistake.

Answering about the sizes:
X_batch shape is ([1, 129, 39])
y_batch is ([129])

in this line outputs = net(X_batch) the outputs size is torch.Size([1, 129, 2])
therefore
I have the next line outputs = outputs.view(batch_size,output_size) so that the shape is then torch.Size([129, 2]) so that it works for the loss function.
One column is for the class 0 and another for class 1 (their sum is 1). I understand that 1 column in this case would be enough to predict, but the Cross Entropy Loss function, that I use because I need to assign weights, requires such a shape of outputs…

So, after correcting the mistake that you spotted I have the following output
(weirdly only 4 epochs seems enough to stabilize that bad training loss):

Epoch [1] out of 5
Training loss: 0.591… Validation loss: 0.576…
PR AUC: 0.176

Epoch [2] out of 5
Training loss: 0.575… Validation loss: 0.573…

Epoch [3] out of 5
Training loss: 0.573… Validation loss: 0.572…

Epoch [4] out of 5
Training loss: 0.573… Validation loss: 0.571…

Epoch [5] out of 5
Training loss: 0.572… Validation loss: 0.571…
Finished Training
starttime = 2020-09-29 15:28:55.419550
now = 2020-09-29 18:00:04.417759

tom · September 29, 2020, 4:22pm

Maybe you can push up the learning rate a bit to see what happens.

Alice_NL · September 29, 2020, 4:26pm

Dear Tom,

Thank you, I will keep trying! With hidden layer size, with LSTM layers, with learning rate.

For me it is priceless that someone who understands “what s going on in this crazy nn world” read my code, spotted BIG conceptual mistake about batching and (as there were no comments) confirmed that somehow the rest in general resembles how it should look like. Thank you!

Kind regards,

Alice