Validation loss much lower than training loss from the get go

Hi there,

I am training a basic VAE on tabular data (standardized integers, real numbers, binary values and vectorized categories) and whatever I do, my validation loss is always considerably lower than my training loss. They also do not seem to get closer to each other whatsoever. Even from epoch 1:

Epoch 1 complete! Average validation Loss: 14.653213500976562
Epoch 1 complete! Average training Loss: 52.207223892211914

Is there something I am overseeing? This has been breaking my head and I do not know what to do. Also, I am not using any dropout layers in case you were wondering.

Thank you in advance!

My Training loop:

print("Start training VAE...")
epochs = 1000
loss_list = []
val_list = []
for epoch in range(epochs):
    overall_loss = 0
    val_overall_loss = 0
    with torch.no_grad():
        for features in valloader:
            xval = features
            xval_hat, valmean, vallog_var = model(xval)
            valloss = loss_function(xval, xval_hat, valmean, vallog_var)
            val_overall_loss += valloss.item()
    print("\tEpoch", epoch + 1, "complete!", "\tAverage validation Loss: ", val_overall_loss/128)

    for features in trainloader:
        x = features


        x_hat, mean, log_var = model(x)
        loss = loss_function(x, x_hat, mean, log_var)
        overall_loss += loss.item()
    print("\tEpoch", epoch + 1, "complete!", "\tAverage training Loss: ", overall_loss/128)


Do you have a sense of what is a plausible expectation for how low the validation loss can go i.e. how hard is this problem? One (sad) possibility would be that you’re simply overfitting to your training data, and the model fails to generalize to the validation data.

(also a small side note, make sure the batch size used by both loaders is 128, since you have that hardcoded)

Hi, thanks for responding.

I did indeed hardcode the batch size and both loaders should use the same batch size. As for the loss expectation, I am working on a project to apply a VAE on tabular data. Since VAE’s are mostly used for image data I have no clue what to expect, it could be that my dataset is too small (918 instances in total). Do you think it would be wiser to generate a dataset of larger scale myself first to test my model?

I think that’s a great idea. If you generate some synthetic data that has a very clear structure, the model ought to be able to learn it, so it’s a good way to test that the model is set up properly. This will also give you a feel for how much data you need to train your model, depending on how strong the underlying structure is. No idea if 918 instances is enough, it depends on the signal/noise ratio in the data so there’s no general answer. Intuitively it feels low :slight_smile: but I can’t be sure.

So I generated a dataset of 50.000 instances. 2 columns (features) are continuous, 2 columns are categorical, which i converted to embeddings. Both of size 3. This makes my total number of columns/features 8. Dataset shape: 50000 x 8. I split it to 40000/10000 train/validation.

As there are a lot of instances in this dataset, there should be a clear structure as the same recipe is repeated over and over. When I apply training, I still see the validation loss being much lower than the training loss. I get the feeling that it’s not about my data. Do you have any suggestions?

Epoch 1 complete! 	Average training Loss:  191.17870345711708
Epoch 1 complete! 	Average validation Loss:  42.93123149871826

I apologize, I misread your original post. I mistakenly thought your validation loss was the higher one, hence overfitting suspicions.

Do you have any dropout layers in your model? If so, it’s common to see lower validation loss, since dropout gets turned off during validation, leading to better performance.

I read about dropout being a cause for validation loss being consistently lower than training loss. However, I am not using any dropout layers. I am maintaining a simple VAE model as I want to test it with a simple architecture first. Here is my model:

class Encoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, latent_dim):
        super(Encoder, self).__init__()
        self.input1 = nn.Linear(input_dim, hidden_dim)
        self.input2 = nn.Linear(hidden_dim, hidden_dim)

        self.input_mean = nn.Linear(hidden_dim, latent_dim)
        self.input_var = nn.Linear(hidden_dim, latent_dim)

        self.relu = nn.ReLU() = True

    def forward(self, x):
        input = x
        h_ = self.relu(self.input1(input))
        h_ = self.input2(h_)
        mean = self.input_mean(h_)
        log_var = self.input_var(h_)

        return mean, log_var

class Decoder(nn.Module):
    def __init__(self, latent_dim, hidden_dim, output_dim):
        super(Decoder, self).__init__()
        self.output1 = nn.Linear(latent_dim, hidden_dim)
        self.output2 = nn.Linear(hidden_dim, hidden_dim)
        self.output3 = nn.Linear(hidden_dim, output_dim)

        self.relu = nn.ReLU()
    def forward(self, x):
        h = self.relu(self.output1(x))
        h = self.relu(self.output2(h))

        x_hat = self.output3(h)
        return x_hat

class Model(nn.Module):
        def __init__(self, Encoder, Decoder):
                super(Model, self).__init__()
                self.Encoder = Encoder
                self.Decoder = Decoder
        def reparameterization(self, mean, var):
                epsilon = torch.randn_like(var)
                z = mean + var*epsilon
                return z

        def forward(self, x):
                mean, log_var = self.Encoder(x)
                z = self.reparameterization(mean, torch.exp(0.5 * log_var))
                x_hat = self.Decoder(z)

                return x_hat, mean, log_var

encoder = Encoder(input_dim=8, hidden_dim=hidden_dim, latent_dim=latent_dim)
decoder = Decoder(latent_dim=latent_dim, hidden_dim = hidden_dim, output_dim = 8)

model = Model(Encoder=encoder, Decoder=decoder)

from torch.optim import Adam

MSE_loss = nn.MSELoss()

def loss_function(x, x_hat, mean, log_var):
    reproduction_loss = nn.functional.mse_loss(x_hat, x, reduction='sum')
    KLD      = - 0.5 * torch.sum(1+ log_var - mean.pow(2) - log_var.exp())

    return reproduction_loss + KLD

optimizer = Adam(model.parameters(), lr=lr)

I appreciate your help

Hmm, aren’t you just adding more elements to the loss in your training loop than in your validation loop, hence the (proportionally) bigger loss? You are indeed dividing by the batch size, but the training loop has 4x more batches than the validation loop, so won’t the sum be ~4x bigger?

This is something I had not considered before… Thank you for the insight first of all. I thought this: The training and validation for-loop takes one batch at a time. Each batch is the same size in both training and validation (128 for example). Each loss average is also computed per batch (loss / batch-size), does that not make it correct the way i wrote the training code?

For simplicity, suppose that your batch size is 1, and that you have 10 data points in your training set, and 1 data point in your validation set. Suppose that the loss is identical across all the data points, equal to 0.25 (to pick an arbitrary number).

Per your construction above, your overall training loss will be 2.5 (= 10 * 0.25) and your validation loss will be 0.25 (= 1 * 0.25) so your training loss will be bigger than your validation loss in exact proportion to the ratio of dataset sizes.

For this reason it’s customary to average the loss over the (training / validation / test) data set, rather than to sum it.

Let me know if you see something wrong with that argument!

My goodness! It is only now that I realize every epoch of training contains 4 times as many batches as there are batches in the validation set. Thank you for clearing this up for me, what a stupid mistake. I spent too many hours breaking my head over this. Would a proper way to deal with this be to simply divide the training loss by 4? As the training data is exactly 4 times as large as the validation set (80/20 ratio). Or is there a better way to formulate the training loop? Again, thank you

Haha, no worries.

My approach has been to, every epoch, store all the losses in a list (so you have one list for training and another list for validation) and, before moving to the next epoch, just compute the average over each list and use that as the loss. That way I don’t have a hardcoded number hiding in my code (like dividing by 4, or dividing by 128) which may do the wrong thing in the future if the size of the dataset changes, or I decide to reuse the same code somewhere else and forget to adapt that number.

Something like:

for epoch in range(epochs):
    train_losses, valid_losses = [], []
    for features in trainloader:
    for features in validloader:
    train_loss_avg = np.mean(train_losses)
    valid_loss_avg = np.mean(valid_losses)
    print(train_loss_avg, valid_loss_avg)

But I’m sure you can come up with other ways, just have to make sure you don’t have a proportion issue.

1 Like

Very straightforward and good solution. Thank you, I was tunnel-visioned for a long time. I can now experiment with the VAE :smile: