Cross Entropy Loss vs Batch Size

maralm · March 8, 2020, 5:39pm

Is that normal that cross entropy loss is increasing by increasing the batch size?
I have the following loss:

loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

I am comparing the batch size of 32 using two methods:
1- Using device batch size=32
2- Using device batch size=2 with gradient accumulation step=16

For the first approach, loss starts fro, 0.6599792838096619 while for the second approach it starts from 0.0303945392370224. It looks like it is scaled by the batch size. Do I need to divide it by batch size again before backward or it is correct?

KFrank · March 8, 2020, 7:04pm

Hello Maral!

By default, CrossEntropyLoss does not increase with batch size.

From its documentation, unless you explicitly construct it with
reduction = 'sum', it will default to reduction = 'mean',
for which “the sum of the output will be divided by the number
of elements in the output.”

Here is a short (pytorch version 0.3.0) script that illustrates this:

import torch
torch.__version__

torch.manual_seed (2020)

pred2 = torch.autograd.Variable (torch.randn (2, 5))
targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))

pred2
targ2

pred32 = pred2.repeat (16, 1)
targ32 = targ2.repeat (16)

pred32.shape
targ32.shape

loss_fn = torch.nn.CrossEntropyLoss()

loss_fn (pred2, targ2)
loss_fn (pred32, targ32)

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x0000020A70856630>
>>>
>>> pred2 = torch.autograd.Variable (torch.randn (2, 5))
>>> targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))
>>>
>>> pred2
Variable containing:
 1.2372 -0.9604  1.5415 -0.4079  0.8806
 0.0529  0.0751  0.4777 -0.6759 -2.1489
[torch.FloatTensor of size 2x5]

>>> targ2
Variable containing:
 0
 1
[torch.LongTensor of size 2]

>>>
>>> pred32 = pred2.repeat (16, 1)
>>> targ32 = targ2.repeat (16)
>>>
>>> pred32.shape
torch.Size([32, 5])
>>> targ32.shape
torch.Size([32])
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss()
>>>
>>> loss_fn (pred2, targ2)
Variable containing:
 1.3058
[torch.FloatTensor of size 1]

>>> loss_fn (pred32, targ32)
Variable containing:
 1.3058
[torch.FloatTensor of size 1]

As noted above, the larger loss is not coming from the larger batch
size fed to CrossEntropyLoss. Without seeing your actual code,
especially how you implement

gradient accumulation step=16,

it’s hard to guess what might be going on.

Best.

K. Frank

maralm · March 8, 2020, 7:56pm

Thanks! I was expecting what you said but see scaled loss for some reason. Accuracy is normal but loss scale is different for approach 1 and 2!