How to calculate epoch loss when using BCEWithLogitsLoss

se7en.dreams · July 12, 2022, 9:23pm

Hello, I’m a bit confused about how to accumulate the batch losses to obtain the epoch loss.

Two questions:

Is #1 (see comments below) correct way to calculate loss with masks)
Is #2 correct way to report epoch loss)

optimizer = torch.optim.Adam(model.parameters, lr=1e-3, weight_decay=1e-5)
criterion = torch.nn.BCEWithLogitsLoss(pos_weight=pos_weight)

for epoch in range(10):
    EPOCH_LOSS = 0.
    
    for inputs, gt_labels, masks in training_dataloader:
        optimizer.zero_grad()
        outputs = model(inputs)
        
        #1: Is this the correct way to calculate batch loss? Do I multiply batch_loss with outputs.shape[0[ before adding it to epoch_loss?
        batch_loss = (masks * criterion(outputs, gt_labels.float())).mean()
        EPOCH_LOSS += batch_loss
        loss.backward()
        optimizer.step()

    #2: then what do I do here? Do I divide the EPOCH_LOSS with len(training_dataloader)?
    print(f'EPOCH LOSS: {EPOCH_LOSS/len(training_dataloader)}:.3f')

@ptrblck @ParGG

thecho7 · July 13, 2022, 5:39am

BCEWithLogitsLoss returns a float tensor having a single element unless you call it with reduction='none'.
Would you explain a bit more about what masks does in your model?
Since batch_loss is a tensor, it is recommended to use EPOCH_LOSS += batch_loss.item() instead of EPOCH_LOSS += batch_loss.
What I know is the length of dataloader(generator) is determined to round(len(dataset) / batch_size). EPOCH_LOSS / len(dataset) would be correct.

se7en.dreams · July 13, 2022, 6:20am

Thanks for your response.

So my outputs shape is (14, 10, 128), where 14 is the batch_size, 10 is the seq_len, and 128 is the object vector where if an element in sequence belongs to any of 128 objects, it is marked as 1 and 0 otherwise. The mask tells us the true length of the sequences. So, its shape is (14, 10). For instance, the first sequence might only have 3 elements (so it’s true shape would be 3 x 128), and the rest (7 x 128) is just padding.
So basically, I should divide it by len(dataloader.dataset)?

ParGG · July 13, 2022, 12:58pm

Not always. As you can see from the implementation, the length of the dataset depends on a few factors.

    def __len__(self) -> int:
        if self._dataset_kind == _DatasetKind.Iterable:
            # NOTE [ IterableDataset and __len__ ]
            #
            # For `IterableDataset`, `__len__` could be inaccurate when one naively
            # does multi-processing data loading, since the samples will be duplicated.
            # However, no real use case should be actually using that behavior, so
            # it should count as a user error. We should generally trust user
            # code to do the proper thing (e.g., configure each replica differently
            # in `__iter__`), and give us the correct `__len__` if they choose to
            # implement it (this will still throw if the dataset does not implement
            # a `__len__`).
            #
            # To provide a further warning, we track if `__len__` was called on the
            # `DataLoader`, save the returned value in `self._len_called`, and warn
            # if the iterator ends up yielding more than this number of samples.

            # Cannot statically verify that dataset is Sized
            length = self._IterableDataset_len_called = len(self.dataset)  # type: ignore[assignment, arg-type]
            if self.batch_size is not None:  # IterableDataset doesn't allow custom sampler or batch_sampler
                from math import ceil
                if self.drop_last:
                    length = length // self.batch_size
                else:
                    length = ceil(length / self.batch_size)
            return length
        else:
            return len(self._index_sampler)

If you have specified the bath_size and drop_last is true: you have to divide by len(dataloader) * batch_size
If you have specified the bath_size and drop_last is false: you have to divide by len(dataset)
If you didn’t specify the batch_size: you have to look at the sampler or batch_sampler

thecho7 · July 14, 2022, 12:11am

In that case, you have two choices

masks is applied to outputs directly before calculating the loss.
Create an BCE loss instance with an argument reduction='none' and do masking after calculating the loss.

Here’s a simple strategy how to apply a mask to the result.
Let me assume gt_labels has (3 x 1) then,

B, L = masks.size()
outputs = outputs.view(-1, outputs.shape[-1]) # (B, L_true, D) -> (B x L_true, D)
masks = torch.argwhere(masks.view(-1)).squeeze() # (B, L) -> (B x L)

masked_outputs = torch.index_select(input=outputs, dim=0, index=masks) # (B x L, D)
masked_outputs = masked_outputs.view(B, L, -1)

batch_loss = criterion(masked_outputs, gt_labels.float())

If anyone knows more effective way to do masking, need your idea

ParGG · July 14, 2022, 2:13pm

I already replied to him about this topic on a different post: Filter Output Using Mask.

se7en.dreams · July 14, 2022, 8:39pm

Thank you for your reply. Shouldn’t we also have masked_gt_labels before calculating loss?

ParGG · July 15, 2022, 8:13am

In his case not, because he selects just the valid data given the mask. If instead you would compute the loss for your original outputs and labels, then you would need to mask the output.