Why do we multiply loss function with a constant?

cramraj8 · February 24, 2021, 8:26pm

Why do we multiply loss with a constant sometimes? I have seen people use

loss = nn.functional.binary_cross_entropy_with_logits(logits, labels)
loss *= labels.size(1)

or sometimes

loss = nn.functional.binary_cross_entropy_with_logits(logits, labels)
loss *= labels.size(0)

I am not sure the difference between using no constant multiply with loss, multiply with num of classes, multiply with batch size. Any body can give me an explanation ?

ptrblck · February 26, 2021, 6:14am

The loss is sometimes scaled with the batch size to calculate the epoch loss without adding a small potential error due to varying batch sizes. In particular, the last batch might often be smaller than the test, if the number of samples isn’t divisible by the batch size without a remainder.
In that case, adding the current batch losses to a running loss via epoch_loss += loss.item() and dividing by the number of batches (len(loader)) would add a small error.
If you scale the loss with the batch size before adding it to the running loss and divide by the number of samples (len(dataset)), you would avoid it.

kountaydwivedi · November 22, 2021, 4:23pm

The response is quite insightful. However, could it be possible if you could give a brief example, or some resource where one could explore this quest. I am curious to know exactly how multiplying loss with batch size helps.

Thank you.

ptrblck · November 22, 2021, 8:34pm

Here is a simple example you could check and play around with to see the effect of the previous description:

data = torch.randn(20, 1)
target = torch.randn(20, 1)
criterion = nn.MSELoss()

# compute loss of entire dataset
loss_ref = criterion(data, target)

# compute loss using mini batches
dataset = torch.utils.data.TensorDataset(data, target)
loader = torch.utils.data.DataLoader(dataset, batch_size=8)
# loader will return 3 batches with [8, 8, 4] samples

# average by number of batches
loss_avg1 = 0.
for data, target in loader:
    loss = criterion(data, target)
    loss_avg1 += loss.item()
loss_avg1 = loss_avg1 / len(loader)

# average by number of samples
loss_avg2 = 0.
for data, target in loader:
    loss = criterion(data, target)
    loss_avg2 += loss.item() * data.size(0)
loss_avg2 = loss_avg2 / len(loader.dataset)

# compare
print('reference {}, avg1 {}, avg2 {}'.format(
    loss_ref, loss_avg1, loss_avg2))

You could write down the formula used to create the final losses for all use cases and check where the error is coming from. To do so, note that the criterion calculates the mean loss by default.

Flock1 · December 20, 2021, 6:37pm

But we won’t backprop the multiplies loss through the training process right? We will only backdrop the loss = criterion(outputs, labels)

ptrblck · December 20, 2021, 9:48pm

Yes, you won’t be able to call loss_avg1.backward() or loss_avg2.backward() since these objects are plain Python floats and are not attached to any computation graph (loss.item() detaches the value from the computation graph and returns a Python literal).