DataLoader(shuffle=True) creates mall variance in testloss

JanoschMenke · June 1, 2022, 12:12pm

Hi Guys,

so I noticed that when I repeatedly evaluate the my test set the loss calculation changes slightly after each evaluation. But only the Loss the other metric like AUC remain the same. I also use model.eval()

My Loss Function looks roughly like this, where mask is a binary Matrix where 0 indicate positions for which no loss should be computed. I feel like this function should not be an issue as random batches should still yield the same average loss across the whole epoch.

self.criterion = nn.BCEWithLogitsLoss(reduction="none")
def missingBCEWithLogitsLoss(self, logits, target, mask):
    unreduced_loss = self.criterion(logits, target) * mask
    reduced_loss = torch.div(torch.sum(unreduced_loss), torch.sum(mask))
    return reduced_loss

Andrei_Cristea · June 1, 2022, 12:52pm

Hi Janosch,

How large is the variation you get when computing the loss over the entire test set i.e. is the difference small enough to be due to floating point issues?

Are you definitely iterating over the entire test set?

Perhaps you could share the snippet of code that does the testing, by iterating over the test dataloader / dataset and whatnot.

Best,
Andrei

JanoschMenke · June 1, 2022, 1:01pm

So the average loss sits at around ~0.250
But the variation is +/- 0.005, which is too big to come from floating points:

What I noticed is when I switch from average across only the “not masked” values, this problems is present. When I instead average across all values it does not happen.

self.criterion = nn.BCEWithLogitsLoss(reduction="none")
def missingBCEWithLogitsLoss(self, logits, target, mask):
    unreduced_loss = self.criterion(logits, target) * mask
    reduced_loss =torch.mean(unreduced_loss)
    return reduced_loss

JanoschMenke · June 1, 2022, 1:28pm

Yes the issue is the taking the mean across only the number of “non-masked” values.
That can cause the loss to depend items in the batch.

So for each loss per batch we divide the sum of the loss by the number of non-masked tokens:
Here is an example of two batches

(10/7) + (5/4)

The total loss across the tow batches is : 10 + 5 = 15
The total number of non-masked values is 7 + 4 = 11
So the true loss should be 15 / 11 = 1.36..

However, the loss obtained from the batched computation is

((10/7) + (5/4) ) = 1.339

Now if I shuffle my batches:

(6/5) + (9/6) ) = 1.35

The Numerators still sum to 15 and the Denominator still sums to 15. But the loss changed again slightly

Andrei_Cristea · June 1, 2022, 1:30pm

Makes sense! If you replace torch.mean(unreduced_loss) with torch.sum(unreduced_loss), I think that will fix the issue.