# Cross Entropy Loss vs Batch Size

Is that normal that cross entropy loss is increasing by increasing the batch size?
I have the following loss:

``````loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
``````

I am comparing the batch size of 32 using two methods:
1- Using device batch size=32
2- Using device batch size=2 with gradient accumulation step=16

For the first approach, loss starts fro, 0.6599792838096619 while for the second approach it starts from 0.0303945392370224. It looks like it is scaled by the batch size. Do I need to divide it by batch size again before backward or it is correct?

Hello Maral!

By default, `CrossEntropyLoss` does not increase with batch size.

From its documentation, unless you explicitly construct it with
`reduction = 'sum'`, it will default to `reduction = 'mean'`,
for which “the sum of the output will be divided by the number
of elements in the output.”

Here is a short (pytorch version 0.3.0) script that illustrates this:

``````import torch
torch.__version__

torch.manual_seed (2020)

pred2 = torch.autograd.Variable (torch.randn (2, 5))
targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))

pred2
targ2

pred32 = pred2.repeat (16, 1)
targ32 = targ2.repeat (16)

pred32.shape
targ32.shape

loss_fn = torch.nn.CrossEntropyLoss()

loss_fn (pred2, targ2)
loss_fn (pred32, targ32)
``````

And here is the output:

``````>>> import torch
>>> torch.__version__
'0.3.0b0+591e73e'
>>>
>>> torch.manual_seed (2020)
<torch._C.Generator object at 0x0000020A70856630>
>>>
>>> pred2 = torch.autograd.Variable (torch.randn (2, 5))
>>> targ2 = torch.autograd.Variable (torch.LongTensor (2).random_ (0, 5))
>>>
>>> pred2
Variable containing:
1.2372 -0.9604  1.5415 -0.4079  0.8806
0.0529  0.0751  0.4777 -0.6759 -2.1489
[torch.FloatTensor of size 2x5]

>>> targ2
Variable containing:
0
1
[torch.LongTensor of size 2]

>>>
>>> pred32 = pred2.repeat (16, 1)
>>> targ32 = targ2.repeat (16)
>>>
>>> pred32.shape
torch.Size([32, 5])
>>> targ32.shape
torch.Size()
>>>
>>> loss_fn = torch.nn.CrossEntropyLoss()
>>>
>>> loss_fn (pred2, targ2)
Variable containing:
1.3058
[torch.FloatTensor of size 1]

>>> loss_fn (pred32, targ32)
Variable containing:
1.3058
[torch.FloatTensor of size 1]
``````

As noted above, the larger loss is not coming from the larger batch
size fed to `CrossEntropyLoss`. Without seeing your actual code,
especially how you implement