WHY does batch influence the loss?

pourya_farzi · September 8, 2020, 9:21am

HI, I’m wondering about batch size, which definitely influences the epoch-loss graph. If we suppose we have the following tensors as the predictions of the model and real values, the result of loss would be different due to the shifting of elements between the batches. (values are random and not normalized)

loss = nn.MSELoss()
predicted_batch_1 = torch.tensor([40., -1,-0.7,1]).float()
ground_truth_batch_1 =  torch.tensor([1.1,180,15,1]).float() 

predicted_batch_2 =  torch.tensor([1,89.5]).float()
ground_truth_batch_2 = torch.tensor([1,14]).float()

output1 = loss(predicted_batch_1 , ground_truth_batch_1 )  # is  8630.1748
output2 = loss(predicted_batch_2 , ground_truth_batch_2 ) #  is 2850.1250
total loss= output1 +output2 # 11480.2998

if we concatenate the predicted tensors and also do concatenation for ground truths. then the result is 6703.4917
HEREIN LIES THE PROBLEM :: if we swap one element in both predicted and ground truth from one batch to another (like the following snippet ), the result would be different and we can’t easily use the mean of the total loss for plotting EPOCH_LOSS since the mean of the above batches is different from the one below.

predicted_batch_1 = torch.tensor([40., -1,-0.7,89.5]).float()
ground_truth_batch_1 =  torch.tensor([1.1,180,15,14]).float() 

predicted_batch_2 =  torch.tensor([1,1]).float()
ground_truth_batch_2 = torch.tensor([1,1]).float()

output1 = loss(predicted_batch_1 , ground_truth_batch_1 )  # is  10055.2373
output2 = loss(predicted_batch_2 , ground_truth_batch_2 ) #  is 0.0
total loss= output1 +output2 # 10055.2373

I’m looking forward to your valuable comments.

tom · September 8, 2020, 9:49am

Well, here, you have a batch size of 4 and one of 2 and set up MSELoss to average over the batch items.
If you concatenate, you take MSELoss to average over all 6 items and it effectively computes the weighted average (4*8630+2*2850)/6 = 6703.
When using a typical setup with PyTorch’s DataLoader, all batches but the last (and there is the drop_last argument to just leave it out) have the same size, so the weights in the averaging would be all the same.