Validation loss lesser than training loss

Hi ,

train_df.shape
(7111, 8)

I have following predictor variable

     deg_C	relative_humidity	absolute_humidity	sensor_1	sensor_2	sensor_3	sensor_4	sensor_5
0	  13.1	             46.0	           0.7578	  1387.2	  1087.8  	  1056.0	  1742.8	  1293.4
1	  13.2	             45.3	           0.7255	  1279.1	   888.2	  1197.5	  1449.9	  1010.9

and here target variables

  target_carbon_monoxide	target_benzene	target_nitrogen_oxides
0	                 2.5	          12.0	                 167.7
1	                 2.1	           9.9	                  98.9

My workaround is

Scaling Both Predictor and targets

X_train, X_val, y_train, y_val = train_test_split(train_df, train_y, test_size=0.2, random_state=0)

scalerX = StandardScaler().fit(X_train)
scalery = StandardScaler().fit(y_train)

X_train = scalerX.transform(X_train)
y_train = scalery.transform(y_train)

X_val = scalerX.transform(X_val)
y_val = scalery.transform(y_val)
class ClassifierDataset(Dataset):
    
    def __init__(self, X_data, y_data):
        self.X_data = X_data
        self.y_data = y_data
        
    def __getitem__(self, index):
        return self.X_data[index], self.y_data[index]
        
    def __len__ (self):
        return len(self.X_data)
NUM_FEATURES     = 8
BATCHSIZE        = 500


train_dataset = ClassifierDataset(torch.from_numpy(X_train), torch.from_numpy(y_train))
train_loader  = DataLoader(dataset=train_dataset,batch_size=BATCHSIZE)

valid_dataset = ClassifierDataset(torch.from_numpy(X_val), torch.from_numpy(y_val))
valid_loader  = DataLoader(dataset=valid_dataset,batch_size=BATCHSIZE)

Model

class multiNet(nn.Module):
    def __init__(self, num_feature):
        super().__init__()

        self.lin0 = nn.Linear(num_feature,5)
        self.lin1 = nn.Linear(5, 4)
        self.lin2 = nn.Linear(4, 3)
        
        self.bn0 = nn.BatchNorm1d(num_feature)
        self.bn1 = nn.BatchNorm1d(5)
        self.bn2 = nn.BatchNorm1d(4)

    def forward(self, x):
        
        x = self.bn0(x)
        x = self.lin0(x) 
        x = F.relu(x)
        x = self.bn1(x)
        
        x = self.lin1(x) 
        x = F.relu(x)

        x = self.lin2(x) # output layer

        return x

Optimizer

def get_optimizer(model, lr):
    optimizer = torch.optim.Adam(model.parameters(),lr=lr,weight_decay=0.01) 
    scheduler = torch.optim.lr_scheduler.ReduceLROnPlateau(optimizer,patience=10,factor=0.1,threshold=1e1) 
  
    return optimizer,scheduler

Weights

def init_weights(m):
    if type(m) == nn.Linear:
        '''
        This is an example i created as instructed by Andrew Ng course
        torch.ones([10, 6])*torch.sqrt(torch.Tensor([2.])/torch.Tensor([1.4142]))
        '''
        m.weight.data = torch.randn(m.weight.size())*torch.sqrt(torch.Tensor([2.])/m.weight.size()[1])

Model Training

seed = 4
torch.manual_seed(seed)

def train_loop(model, epochs, lr):
    total    = 0
    sum_loss = 0
    output   = 0
    
    criterion   = nn.L1Loss()
    optim,scheduler = get_optimizer(model, lr = lr)
    
    train_loss_values  = []
    val_loss_values    = []
    
    for epoch in range(epochs):
        train_running_loss = 0.0
        val_running_loss   = 0.0
        
        for data, target in train_loader:
            model.train()
            output = model(data)
            loss = torch.sqrt(criterion(output, target)) # as i need RMSLE
            optim.zero_grad()
            loss.backward()
            optim.step()
            train_running_loss =+ loss.item()
            
            if not  scheduler.__class__ ==  torch.optim.lr_scheduler.ReduceLROnPlateau:
                scheduler.step()
        
        train_loss_values.append(train_running_loss)
        
        for valid_data, valid_target in valid_loader:
            model.eval()
            valid_output = model(valid_data)
            valid_loss = criterion(valid_output,valid_target)
            val_running_loss =+ valid_loss.item()
        val_loss_values.append(val_running_loss)
        
        if epoch % 100 ==0:
            print(f'epoch : {epoch+1},training loss : {loss} , validation loss : {valid_loss}')            
    plt.plot(train_loss_values, 'r')
    plt.plot(val_loss_values, 'g')      

NUM_FEATURES = X_train.shape[1]
multiNetModel = multiNet(NUM_FEATURES)
multiNetModel.apply(init_weights) 
train_loop(multiNetModel, epochs=500, lr=0.001)

Is my above baseline approach is correct ? If it is then why my validation loss is way lesser than training loss

epoch : 1 ,training loss : 1.109604 , validation loss : 0.995102
epoch : 101,training loss : 0.608683 , validation loss : 0.294198
epoch : 201,training loss : 0.518566 , validation loss : 0.225262
epoch : 301,training loss : 0.514474 , validation loss : 0.221938
epoch : 401,training loss : 0.512781 , validation loss : 0.220481

Hi Mr. Oraware!

The bulk of the discrepancy is because you have sqrt() in your
training loss and not in your validation loss. (Even accounting for
this, your training loss is still perhaps about 20% greater than you
validation loss, but this is easier to hand-wave away with more
“hard” samples in your training set or normalization vagaries, etc.)

Best.

K. Frank

1 Like

@KFrank thanks for correction , yes i fixed it and now gap between training and validation has been closed , but still validation loss is lesser than training , could you please elaborate more what do you mean by

" but this is easier to hand-wave away with more
“hard” samples in your training set or normalization vagaries, etc."

epoch : 1,training loss : 1.1062123730735536 , validation loss : 0.9584911298922756
epoch : 101,training loss : 0.5904753306208222 , validation loss : 0.5017905667378958
epoch : 201,training loss : 0.4886256533068133 , validation loss : 0.43560073343017525
epoch : 301,training loss : 0.4826799846960937 , validation loss : 0.43197416186192145
epoch : 401,training loss : 0.47715110642727154 , validation loss : 0.4264545226974467

Hi Mr. Oraware!

Some context:

As a general rule, you would expect (after some training) your training
loss to be smaller than your validation loss. This is because, when
training on your training data, your model “learns” not only the general
features of the problem you are trying to solve, but also some specific
features of your training set. That is, it “learns” (or perhaps “memorizes”)
that it should make certain specific predictions for certain specific
samples in your training set. So your model performs better than it
“should” on these training samples it has “memorized” and therefore
produces a smaller training loss. But, as you note, you are seeing
the opposite.

“Hard” samples: Consider digit classification. If sloppy enough a “1” and
a “7” can be hard to tell apart, as can be a “4” and a “9.” Suppose that,
by happenstance, your training set has a lot of sloppy "1"s and "4"s, but
your evaluation set doesn’t. When you make predictions on your training
set, you’re asking it to something harder – correctly recognize those
sloppy "1"s – than you’re asking it to do with your validation set. So you
get a larger (worse) loss on your training set.

Normalization: You normalize both your training set and validation set
using statistics computed from you training set. (I think this is the right
thing to do.) But your loss criterion – L1Loss scales with the size of
target (and predictions). Let’s say that your training set has a “mean”
of 1.0 so that (in simplified terms) your normalization doesn’t actually
change anything, but, by happenstance, your validation set has a “mean”
of 0.9 – and this is not changed by normalizing it with the training-set
statistics. Now, all else being equal, your validation-set loss will be
smaller by a factor of 0.9 than your training-set loss, merely because
the values given to your L1Loss criterion are, themselves, smaller by
a factor of 0.9.

One comment here: From your graph, it appears that your training
loss starts out higher than your validation loss, even on the first batch
of each, which, if I understand your code, is before any optimizer steps
for your first training batch, but after one full epoch of training for your
first validation batch. So this is not an apples-to-apples comparison.

In order for your training and validation losses to be comparable,
you want your training and validation set to have the same overall
characteristics. Do you have a single larger dataset that you randomly
split into a training and validation set? Or do your training and validation
data come from two separate sources?

Try taking a single dataset (for example, combine together your current
training and validation sets if they come from different sources) and
randomly split that single dataset into training and validation a handful
of times and run your training on those multiple random choices for
your training and validation data. See if any systematic discrepancy
occurs across multiple choices of the dataset split.

Lastly, I don’t have good intuition about how the differing behavior of
your BatchNorm1d layers in train() and eval() modes might affect
your losses.

To explore this you could, after each epoch of training, put your model
in eval() mode (and, of course, not call optim.step()) and calculate
your training and validation loss (on batches of the same size). Plot those
as a function of epoch and see if the discrepancy persists.

Best.

K. Frank

1 Like

@KFrank i am working on your suggestion , i will get back to you please bear me