Validation loss and training loss are inconsistent

sparshgarg23 · January 18, 2022, 7:15am

I am trying to train an autoencoder for the purpose of regression
The network is simple with two linear layers with batch normalization in encoder stage and one simple linear layer with no activation.

I am using L1 loss ,and during training the following loss log was obtained which seems ok given the nature of my data

Device: cpu
Epoch: 000/005 | Batch 0000/0013 | Cost: 70.2134
Epoch: 000/005 | Batch 0001/0013 | Cost: 69.9557
Epoch: 000/005 | Batch 0002/0013 | Cost: 69.7368
Epoch: 000/005 | Batch 0003/0013 | Cost: 69.4878
Epoch: 000/005 | Batch 0004/0013 | Cost: 69.2337
Epoch: 000/005 | Batch 0005/0013 | Cost: 68.9682
Epoch: 000/005 | Batch 0006/0013 | Cost: 68.7011
Epoch: 000/005 | Batch 0007/0013 | Cost: 68.3944
Epoch: 000/005 | Batch 0008/0013 | Cost: 68.1173
Epoch: 000/005 | Batch 0009/0013 | Cost: 67.8054
Epoch: 000/005 | Batch 0010/0013 | Cost: 67.4841
Epoch: 000/005 | Batch 0011/0013 | Cost: 67.1984
Epoch: 000/005 | Batch 0012/0013 | Cost: 69.7803
Epoch: 001/005 | Batch 0000/0013 | Cost: 66.5010
Epoch: 001/005 | Batch 0001/0013 | Cost: 66.2217
Epoch: 001/005 | Batch 0002/0013 | Cost: 65.9123
Epoch: 001/005 | Batch 0003/0013 | Cost: 65.5267
Epoch: 001/005 | Batch 0004/0013 | Cost: 65.2258
Epoch: 001/005 | Batch 0005/0013 | Cost: 64.8399
Epoch: 001/005 | Batch 0006/0013 | Cost: 64.4806
Epoch: 001/005 | Batch 0007/0013 | Cost: 64.0471
Epoch: 001/005 | Batch 0008/0013 | Cost: 63.7403
Epoch: 001/005 | Batch 0009/0013 | Cost: 63.2445
Epoch: 001/005 | Batch 0010/0013 | Cost: 62.8701
Epoch: 001/005 | Batch 0011/0013 | Cost: 62.4358
Epoch: 001/005 | Batch 0012/0013 | Cost: 64.3226
Epoch: 002/005 | Batch 0000/0013 | Cost: 61.6072
Epoch: 002/005 | Batch 0001/0013 | Cost: 61.1589
Epoch: 002/005 | Batch 0002/0013 | Cost: 60.6826
Epoch: 002/005 | Batch 0003/0013 | Cost: 60.1973
Epoch: 002/005 | Batch 0004/0013 | Cost: 59.7261
Epoch: 002/005 | Batch 0005/0013 | Cost: 59.2015
Epoch: 002/005 | Batch 0006/0013 | Cost: 58.7691
Epoch: 002/005 | Batch 0007/0013 | Cost: 58.2427
Epoch: 002/005 | Batch 0008/0013 | Cost: 57.7692
Epoch: 002/005 | Batch 0009/0013 | Cost: 57.2405
Epoch: 002/005 | Batch 0010/0013 | Cost: 56.5343
Epoch: 002/005 | Batch 0011/0013 | Cost: 56.0606
Epoch: 002/005 | Batch 0012/0013 | Cost: 62.1784
Epoch: 003/005 | Batch 0000/0013 | Cost: 54.8221
Epoch: 003/005 | Batch 0001/0013 | Cost: 54.2681
Epoch: 003/005 | Batch 0002/0013 | Cost: 53.5823
Epoch: 003/005 | Batch 0003/0013 | Cost: 53.0807
Epoch: 003/005 | Batch 0004/0013 | Cost: 52.3433
Epoch: 003/005 | Batch 0005/0013 | Cost: 51.6870
Epoch: 003/005 | Batch 0006/0013 | Cost: 51.1865
Epoch: 003/005 | Batch 0007/0013 | Cost: 50.5368
Epoch: 003/005 | Batch 0008/0013 | Cost: 49.5710
Epoch: 003/005 | Batch 0009/0013 | Cost: 49.1832
Epoch: 003/005 | Batch 0010/0013 | Cost: 48.4606
Epoch: 003/005 | Batch 0011/0013 | Cost: 47.7079
Epoch: 003/005 | Batch 0012/0013 | Cost: 60.0171
Epoch: 004/005 | Batch 0000/0013 | Cost: 46.1114
Epoch: 004/005 | Batch 0001/0013 | Cost: 45.3702
Epoch: 004/005 | Batch 0002/0013 | Cost: 44.6226
Epoch: 004/005 | Batch 0003/0013 | Cost: 43.9987
Epoch: 004/005 | Batch 0004/0013 | Cost: 43.0381
Epoch: 004/005 | Batch 0005/0013 | Cost: 42.2448
Epoch: 004/005 | Batch 0006/0013 | Cost: 41.2923
Epoch: 004/005 | Batch 0007/0013 | Cost: 40.5142
Epoch: 004/005 | Batch 0008/0013 | Cost: 39.8084
Epoch: 004/005 | Batch 0009/0013 | Cost: 38.8818
Epoch: 004/005 | Batch 0010/0013 | Cost: 38.1623
Epoch: 004/005 | Batch 0011/0013 | Cost: 36.9880
Epoch: 004/005 | Batch 0012/0013 | Cost: 51.9905

However,when I try to obtain the validation loss,the validation loss ends up going in the range of thousands,I am not sure why this is happening,in both train and validation I am using L1 loss.

Device: cpu
Epoch: 000/005 | Batch 0000/0013 | Cost: 70.2134
Epoch: 000/005 | Batch 0001/0013 | Cost: 69.9557
Epoch: 000/005 | Batch 0002/0013 | Cost: 69.7368
Epoch: 000/005 | Batch 0003/0013 | Cost: 69.4878
Epoch: 000/005 | Batch 0004/0013 | Cost: 69.2337
Epoch: 000/005 | Batch 0005/0013 | Cost: 68.9682
Epoch: 000/005 | Batch 0006/0013 | Cost: 68.7011
Epoch: 000/005 | Batch 0007/0013 | Cost: 68.3944
Epoch: 000/005 | Batch 0008/0013 | Cost: 68.1173
Epoch: 000/005 | Batch 0009/0013 | Cost: 67.8054
Epoch: 000/005 | Batch 0010/0013 | Cost: 67.4841
Epoch: 000/005 | Batch 0011/0013 | Cost: 67.1984
Epoch: 000/005 | Batch 0012/0013 | Cost: 69.7803
Validation Loss Decreased(inf--->220074.531250) 	 Saving The Model
Epoch: 001/005 | Batch 0000/0013 | Cost: 220074.5469
Epoch: 001/005 | Batch 0001/0013 | Cost: 66429.5312
Epoch: 001/005 | Batch 0002/0013 | Cost: 5347.3374
Epoch: 001/005 | Batch 0003/0013 | Cost: 17270.3848
Epoch: 001/005 | Batch 0004/0013 | Cost: 19462.8301
Epoch: 001/005 | Batch 0005/0013 | Cost: 19283.0410
Epoch: 001/005 | Batch 0006/0013 | Cost: 17071.7578
Epoch: 001/005 | Batch 0007/0013 | Cost: 5900.5474
Epoch: 001/005 | Batch 0008/0013 | Cost: 27269.3164
Epoch: 001/005 | Batch 0009/0013 | Cost: 4085.6606
Epoch: 001/005 | Batch 0010/0013 | Cost: 4671.1963
Epoch: 001/005 | Batch 0011/0013 | Cost: 67094.2422
Epoch: 001/005 | Batch 0012/0013 | Cost: 19397.0020
Validation Loss Decreased(220074.531250--->21512.488281) 	 Saving The Model
Epoch: 002/005 | Batch 0000/0013 | Cost: 21512.4863
Epoch: 002/005 | Batch 0001/0013 | Cost: 23520.4238
Epoch: 002/005 | Batch 0002/0013 | Cost: 24536.1270
Epoch: 002/005 | Batch 0003/0013 | Cost: 16122.1826
Epoch: 002/005 | Batch 0004/0013 | Cost: 18098.9043
Epoch: 002/005 | Batch 0005/0013 | Cost: 15330.4785
Epoch: 002/005 | Batch 0006/0013 | Cost: 10604.5000
Epoch: 002/005 | Batch 0007/0013 | Cost: 10036.7793
Epoch: 002/005 | Batch 0008/0013 | Cost: 9192.8408
Epoch: 002/005 | Batch 0009/0013 | Cost: 5178.3882
Epoch: 002/005 | Batch 0010/0013 | Cost: 7630.6143
Epoch: 002/005 | Batch 0011/0013 | Cost: 4561.5088
Epoch: 002/005 | Batch 0012/0013 | Cost: 6063.7090
Validation Loss Decreased(21512.488281--->5502.875977) 	 Saving The Model
Epoch: 003/005 | Batch 0000/0013 | Cost: 5502.8760
Epoch: 003/005 | Batch 0001/0013 | Cost: 5709.9111
Epoch: 003/005 | Batch 0002/0013 | Cost: 6867.4663
Epoch: 003/005 | Batch 0003/0013 | Cost: 5088.6343
Epoch: 003/005 | Batch 0004/0013 | Cost: 5191.8086
Epoch: 003/005 | Batch 0005/0013 | Cost: 4265.2114
Epoch: 003/005 | Batch 0006/0013 | Cost: 4393.9121
Epoch: 003/005 | Batch 0007/0013 | Cost: 4127.4048
Epoch: 003/005 | Batch 0008/0013 | Cost: 3400.7935
Epoch: 003/005 | Batch 0009/0013 | Cost: 3129.0066
Epoch: 003/005 | Batch 0010/0013 | Cost: 3357.7268
Epoch: 003/005 | Batch 0011/0013 | Cost: 3299.5447
Epoch: 003/005 | Batch 0012/0013 | Cost: 2972.0579
Validation Loss Decreased(5502.875977--->2440.738770) 	 Saving The Model
Epoch: 004/005 | Batch 0000/0013 | Cost: 2440.7390
Epoch: 004/005 | Batch 0001/0013 | Cost: 2696.4646
Epoch: 004/005 | Batch 0002/0013 | Cost: 2442.3899
Epoch: 004/005 | Batch 0003/0013 | Cost: 2186.0679
Epoch: 004/005 | Batch 0004/0013 | Cost: 2166.6067
Epoch: 004/005 | Batch 0005/0013 | Cost: 2227.7563
Epoch: 004/005 | Batch 0006/0013 | Cost: 2022.2296
Epoch: 004/005 | Batch 0007/0013 | Cost: 1790.8546
Epoch: 004/005 | Batch 0008/0013 | Cost: 1893.8618
Epoch: 004/005 | Batch 0009/0013 | Cost: 5180.7158
Epoch: 004/005 | Batch 0010/0013 | Cost: 1984.6993
Epoch: 004/005 | Batch 0011/0013 | Cost: 2232.0305
Epoch: 004/005 | Batch 0012/0013 | Cost: 2158.0151
Validation Loss Decreased(2440.738770--->1890.082397) 	 Saving The Model

I was wondering is there something wrong with my training code shown below

optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate) 
    for epoch in range(num_epochs):
        for batch_idx,(x,y) in enumerate(train_loader):
            features=func(x,y)
            features.to(device)
            #Forward and backward prop
            decoded=model(features.float())
            cost=F.l1_loss(decoded,features.float())
            optimizer.zero_grad()
            cost.backward()
            #Update model parameters
            optimizer.step()
            ###Logging
            print ('Epoch: %03d/%03d | Batch %04d/%04d | Cost: %.4f' 
                   %(epoch, num_epochs, batch_idx, 
                     len(train_loader), cost))
        model.eval()
        for batch_idx,(x,y) in enumerate(val_loader):
            features_val=func(x,y)
            features.to(device)
            #Decode
            decoded=model(features.float())
            val_cost=F.l1_loss(decoded,features.float())
            if min_val_loss>val_cost:
                print(f'Validation Loss Decreased({min_val_loss:.6f}--->{val_cost:.6f}) \t Saving The Model')
                min_val_loss=val_cost

Some insights about the data
I have a 256x256 image and 17,000 npy files containing [x,y,z] coordinates.
I am trying to flatten the image into a 1d vector,add the x,yz coordinate at the end of this and then use that as my input to the model.
The objective would be then to reconstruct a new feature representation whose last 3 columsn will give me new values about the x,y and z coordinates.

thecho7 · January 18, 2022, 7:39am

Is there any dropout function inside the model?
If yes, please check the p.
The model is turned into eval mode after the first epoch training and it never turns back to train mode. You should call model.train() as

for epoch in range(num_epochs):
    model.train()
    ~~~

sparshgarg23 · January 18, 2022, 9:00am

Yes there is a dropout ,the first is a linear layer with weight set to num_features*num_hidden_dimension
Since num_features is very large almost 65,000
I decided to use a dropout after the first layer.

Also,I replaced model.eval() with "with torch.set_grad_enabled(False) that did the trick.

I have some other questions,if you don’t mind.

At the end of epoch 4,train loss is 32.7 and val loss comes out to be 21,similarly
at end of epoch 5,train loss is 8.07 and val loss is approximately 4.So,it seems that the model is underfitting here right.

Also,once I train the autoencoder,how can I use the model for regression purposes.The objective here is to use the model to predict the x,y and z coordinates similar to the ones defined in the npy file
I am thinking that for a new I new image(256x256),I flatten it to a 1d array and then append a 3x1 vector set to zeros.I will the feed this to the model,which can then predict a new representation.
The model is as follows

class AutoEncoder(torch.nn.Module):
    
    def __init__(self,num_features,num_hidden_1,num_hidden_2,num_hidden_3=256):
        super(AutoEncoder,self).__init__()
        #Encoder
        self.linear_1=nn.Linear(num_features,num_hidden_1)
        self.bn_1=nn.BatchNorm1d(num_hidden_1)
        self.linear_11=nn.Linear(num_hidden_1,num_hidden_2)
        self.bn_2=nn.BatchNorm1d(num_hidden_2)
        #Decoder
        self.linear_21=nn.Linear(num_hidden_2,num_hidden_3)
        self.linear_22=nn.Linear(num_hidden_3,num_features)
        self.bn_3=nn.BatchNorm1d(num_hidden_3)
        self.drop=nn.Dropout(p=0.5)
    
    def forward(self,x):
        encoder=F.leaky_relu(self.bn_1(self.linear_1(x)))
        encoder=self.drop(encoder)
        encoder=F.leaky_relu(self.bn_2(self.linear_11(encoder)))
        decoded=F.leaky_relu(self.bn_3(self.linear_21(encoder)))
        decoded=self.linear_22(decoded)
        return decoded
```.

load model
new feature vector of dimension [256x256x1+3,1]
new_representation=model(new_feature_vector)

Is that the correct way to do this?

thecho7 · January 19, 2022, 12:55am

Hi, I think the autoencoder architecture is not proper to do a regression.

If you are not familiar with this, I recommend you to look more about classification architectures such as LeNet, VGG, ResNet, …
It is possible to modify the last layer nn.Linear(xxxx, num_classes) to nn.Linear(xxxx, 3)
and use somewhat regression loss functions.

sparshgarg23 · January 19, 2022, 4:09am

Actually,that is what I wanted to use,but unfortuately my superiors don’t wish to use such models,and are stuck on using autoencoder.
I will try to convince him to use something similar to an autoregressive model like pixelcnn or MADE,but i don’t think that even those models will work because in the end these are generative models and the task is different