Cross Entropy Loss vs Batch Size

Is that normal that cross entropy loss is increasing by increasing the batch size?
I have the following loss:

loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))

I am comparing the batch size of 32 using two methods:
1- Using device batch size=32
2- Using device batch size=2 with gradient accumulation step=16

For the first approach, loss starts fro, 0.6599792838096619 while for the second approach it starts from 0.0303945392370224. It looks like it is scaled by the batch size. Do I need to divide it by batch size again before backward or it is correct?

Hello Maral!

By default, CrossEntropyLoss does not increase with batch size.

From its documentation, unless you explicitly construct it with
reduction = 'sum', it will default to reduction = 'mean',
for which “the sum of the output will be divided by the number
of elements in the output.”

Here is a short (pytorch version 0.3.0) script that illustrates this:

import torch
torch.__version__

torch.manual_seed (2020)

pred2 = torch.autograd.Variable (torch.randn (2, 5))
targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))

pred2
targ2

pred32 = pred2.repeat (16, 1)
targ32 = targ2.repeat (16)

pred32.shape
targ32.shape

loss_fn = torch.nn.CrossEntropyLoss()

loss_fn (pred2, targ2)
loss_fn (pred32, targ32)

And here is the output:

>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x0000020A70856630>
>>>
>>> pred2 = torch.autograd.Variable (torch.randn (2, 5))
>>> targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))
>>>
>>> pred2
Variable containing:
 1.2372 -0.9604  1.5415 -0.4079  0.8806
 0.0529  0.0751  0.4777 -0.6759 -2.1489
[torch.FloatTensor of size 2x5]

>>> targ2
Variable containing:
 0
 1
[torch.LongTensor of size 2]

>>>
>>> pred32 = pred2.repeat (16, 1)
>>> targ32 = targ2.repeat (16)
>>>
>>> pred32.shape
torch.Size([32, 5])
>>> targ32.shape
torch.Size([32])
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss()
>>>
>>> loss_fn (pred2, targ2)
Variable containing:
 1.3058
[torch.FloatTensor of size 1]

>>> loss_fn (pred32, targ32)
Variable containing:
 1.3058
[torch.FloatTensor of size 1]

As noted above, the larger loss is not coming from the larger batch
size fed to CrossEntropyLoss. Without seeing your actual code,
especially how you implement

gradient accumulation step=16,

it’s hard to guess what might be going on.

Best.

K. Frank

Thanks! I was expecting what you said but see scaled loss for some reason. Accuracy is normal but loss scale is different for approach 1 and 2!