Why my validation loss is 1000s?

Arturas_Druteika · April 3, 2021, 6:03pm

Hello, I am currently fixing this issue where my validation loss compared to train loss is absurdly high. I am training a 2 class classifier. The train part is working fine, I know that because after model is trained I test the model on the third set (the test set) and the results are as they should be. But while training and when validating the loss on validation set I get these nonsensical results that my loss is over 1000. And after every epoch it is not decreasing. What I think is that my validation code is not correct, but I don’t know hat is wrong. Here is my code:

def train(model, train_loader, valid_loader, learning_rate, learning_rate_decay_rate, epochs, device, saved_model_filepath=None):
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    lr_labmda_1 = lambda epoch: learning_rate_decay_rate
    scheduler = MultiplicativeLR(optimizer, lr_lambda=lr_labmda_1)

    for i in range(epochs):
        total_loss_train = 0
        total_loss_valid = 0
        valid_coorect_preds = 0

        model.train()

        for images, labels in train_loader:
            images = images.to(device)
            labels = labels.to(device)

            preds = model(images)

            optimizer.zero_grad()
            loss = criterion(preds, labels)
            loss.backward()
            optimizer.step()

            total_loss_train += loss.item()

        model.eval()

        with torch.no_grad():
            for batch in valid_loader:
                images = batch[0].to(device)
                labels = batch[1].to(device)

                preds = model(images)
                loss = criterion(preds, labels)
                total_loss_valid += loss.item()

        scheduler.step()

        print(f'epoch: {i}, total_loss_train: {total_loss_train: .2f}')
        print(f'epoch: {i}, total_loss_valid: {total_loss_valid: .2f}')
        print()

Train loss is decreasing and numbers are normal, but when looking to validation loss, it is very high (usually between couple of hundreds to 1500) and it is not decreasing at all. What am I doing wrong?

aza · April 3, 2021, 6:37pm

without additional information it’s impossible to tell if the that loss value is not to be expected in this situation. At a first glance it seems like the issue could be that there is not enough signal in the (training) data for the model to be able to generalize to the validation set + it’s overfitting to the training set (hence the improvement in training loss)

Arturas_Druteika · April 3, 2021, 6:47pm

I will update my question after this reply. But for the overfitting part, it is not happening to that extent. When I evaluate results on the third set (test set) after the model already is trained, then the resuts are normal. I have printed the confusion matrix and the results look pretty normal. What I think is that somehow I am not doing this part correctly:

Arturas_Druteika:

model.eval()

with torch.no_grad():
    for batch in valid_loader:
    images = batch[0].to(device)
    labels = batch[1].to(device)

    preds = model(images)
    loss = criterion(preds, labels)
    total_loss_valid += loss.item()

I think somewhere in this part of the code something happens that produces these large loss results, even though my model is not overfitting.

Arturas_Druteika · April 3, 2021, 8:47pm

I found some interesting thing which I don’t understand. When I created my train and valid loaders I made the train loader batch size = 16 and valid loader = 1. The validation total loss was like 10 times the loss of train loss. Then I changed batch size to 16 and the validation loss became like twice as low as train loss. Then I set valid batch size to 32 and the loss became about 4 times smaller than train loss.
I’m still confused with how this thing works. Any help will be appreciated.

Arturas_Druteika · April 3, 2021, 9:13pm

Never mind, I understand it now. The problem (it is actually not a problem if you understand it) was that I was calculating total loss, and my train loader was about twice as big as valid loader. So if I set both loaders batch size to be 16 my train loss on average (if model is not overfitting) should be twice as big as valid loader.