How is the MSELoss() implemented?

Peter_Ham · January 29, 2018, 1:07am

I’m trying to understand how MSELoss() is implemented. Usually people will think MSELoss is (input-target)**2.sum()/batch_size, but when I explicitly write this as the loss function, it turns out that it actually leads to very different training curve from if I use nn.MSELoss()

richard · January 29, 2018, 1:16am

nn.MSELoss() is implemented by default as: ((input-target)**2).mean()

Peter_Ham · January 29, 2018, 1:23am

Thanks. What’s the difference between this and (input-target).pow(2).sum()/input.size()[0]?

richard · January 29, 2018, 2:55am

((input-target)**2).mean() is equivalent to

sizes = input.size()
(input-target).pow(2).sum()/ input.numel()

input.numel() doesn’t have to equal input.size()[0] unless input is a 1D tensor.

rasemailcz · January 30, 2018, 11:21am

Hi,

I tried the following loss functions

output = model(data)
train_loss1 = F.mse_loss(output, target.cuda(), True)
train_loss2 = ((output - target.cuda())**2).mean()

and got different results. Also, the training curve is different.

How is that possible, please?

richard · January 30, 2018, 3:14pm

If training is different, the computation of gradients is probably different, assuming you’re using the same optimizer.

Can you check if your gradients (or the output) is different? Are there any random parts to your model (random initialization, etc?) Weights to nn layers are usually randomly initialized.

rasemailcz · January 30, 2018, 3:35pm

Training is different, but that does not matter actually.

What matters is that train_loss1 and train_loss2 are different numbers, though I would expect them to be equal. How can this happen?

richard · January 30, 2018, 4:40pm

@rasemailcz could you show me some inputs where train_loss1 and train_loss2 are different?

rasemailcz · January 30, 2018, 4:44pm

@richard

output = -0.0226 -0.1922 0.1461 -0.0481 -0.0526 -0.0449 0.1586 -0.0980 0.2177 0.0402 -0.1200 0.0010 0.0022 0.0430 0.1930 0.2304 0.0043 0.0659 0.2427 0.1580 -0.0449 -0.0477 0.1961 0.2336 -0.0308 0.0648 0.0669 0.0072 0.0353 -0.2793 0.0105 -0.1510 -0.0942 -0.1761 0.0477 -0.0564 -0.1628 0.0467 -0.0819 -0.2643 0.1066 -0.0952 0.0918 -0.0934 0.1405 -0.1959 -0.0477 -0.1138 -0.1032 -0.0622 -0.0658 0.2957 -0.1170 -0.1541 0.1663 0.2635 -0.1477 0.2634 0.0940 -0.0477 -0.1920 0.0104 0.3450 -0.0514 -0.1592 0.2188 -0.3998 -0.1696 -0.1194 -0.3216 -0.0702 0.0074 -0.0223 0.0597 0.0329 -0.1500 0.2207 0.1900 -0.0688 0.3510 0.1114 -0.0829 0.0919 -0.1787 0.0266 -0.2059 0.0821 -0.1061 0.1190 0.0090

target.cuda() = 0.3100 0.0638 -0.1865 -0.4252 -0.6371 -0.8090 -0.9300 -0.9922 -0.9919 -0.9289 -0.8072 -0.6343 -0.4212 -0.1814 0.0700 0.3170 0.5440 0.7365 0.8823 0.9720 0.9999 0.9641 0.8669 0.7143 0.5162 0.2850 0.0355 -0.2163 -0.4543 -0.6634 -0.8300 -0.9435 -0.9965 -0.9855 -0.9113 -0.7784 -0.5955 -0.3742 -0.1287 0.1252 0.3711 0.5931 0.7769 0.9105 0.9854 0.9965 0.9431 0.8286 0.6604 0.4493 0.2089 -0.0451 -0.2962 -0.5282 -0.7259 -0.8763 -0.9697 -1.0000 -0.9649 -0.8669 -0.7122 -0.5110 -0.2762 -0.0232 0.2313 0.4707 0.6794 0.8435 0.9523 0.9986 0.9792 0.8954 0.7526 0.5602 0.3308 0.0794 -0.1772 -0.4222 -0.6394 -0.8143 -0.9355 -0.9947 -0.9880 -0.9158 -0.7828 -0.5979 -0.3731 -0.1234 0.1345 0.3836

richard · January 30, 2018, 4:51pm

The following script:

import torch
import torch.nn.functional as F
from torch.autograd import Variable

output = Variable(torch.Tensor(
[-0.0226, -0.1922, 0.1461, -0.0481, -0.0526, -0.0449, 0.1586, -0.0980, 0.2177, 
0.0402, -0.1200, 0.0010, 0.0022, 0.0430, 0.1930, 0.2304, 0.0043, 0.0659, 0.2427, 
0.1580, -0.0449, -0.0477, 0.1961, 0.2336, -0.0308, 0.0648, 0.0669, 0.0072, 0.0353,
-0.2793, 0.0105, -0.1510, -0.0942, -0.1761, 0.0477, -0.0564, -0.1628, 0.0467, 
-0.0819, -0.2643, 0.1066, -0.0952, 0.0918, -0.0934, 0.1405, -0.1959, -0.0477,
-0.1138, -0.1032, -0.0622, -0.0658, 0.2957, -0.1170, -0.1541, 0.1663, 0.2635,
-0.1477, 0.2634, 0.0940, -0.0477, -0.1920, 0.0104, 0.3450, -0.0514, -0.1592, 0.2188,
-0.3998, -0.1696, -0.1194, -0.3216, -0.0702, 0.0074, -0.0223, 0.0597, 0.0329, 
-0.1500, 0.2207, 0.1900, -0.0688, 0.3510, 0.1114, -0.0829, 0.0919, -0.1787, 0.0266, 
-0.2059, 0.0821, -0.1061, 0.1190, 0.0090]), requires_grad=True)

target = Variable(torch.Tensor(
[0.3100, 0.0638, -0.1865, -0.4252, -0.6371, -0.8090, -0.9300, -0.9922, -0.9919, 
-0.9289, -0.8072, -0.6343, -0.4212, -0.1814, 0.0700, 0.3170, 0.5440, 0.7365, 0.8823,
0.9720, 0.9999, 0.9641, 0.8669, 0.7143, 0.5162, 0.2850, 0.0355, -0.2163, -0.4543, 
-0.6634, -0.8300, -0.9435, -0.9965, -0.9855, -0.9113, -0.7784, -0.5955, -0.3742,
-0.1287, 0.1252, 0.3711, 0.5931, 0.7769, 0.9105, 0.9854, 0.9965, 0.9431, 0.8286, 
0.6604, 0.4493, 0.2089, -0.0451, -0.2962, -0.5282, -0.7259, -0.8763, -0.9697, 
-1.0000, -0.9649, -0.8669, -0.7122, -0.5110, -0.2762, -0.0232, 0.2313, 0.4707, 
0.6794, 0.8435, 0.9523, 0.9986, 0.9792, 0.8954, 0.7526, 0.5602, 0.3308, 0.0794,
-0.1772, -0.4222, -0.6394, -0.8143, -0.9355, -0.9947, -0.9880, -0.9158, -0.7828,
-0.5979, -0.3731, -0.1234, 0.1345, 0.3836]))

train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()

gives

In [8]: train_loss1
Out[8]:
Variable containing:
 0.5171
[torch.FloatTensor of size (1,)]

In [9]: train_loss2
Out[9]:
Variable containing:
 0.5171
[torch.FloatTensor of size (1,)]

I’d also expect it to give the same on the gpu. Does your problem occur when your inputs are on the cpu?

rasemailcz · January 30, 2018, 5:04pm

@richard

I found the reason, but I must have missed it in the documentation - can you explain me (redirect me to the documentation), why the output of the

train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()

is different where the sizes of the tensors are:

output.data.size()
Out[9]: torch.Size([90, 1, 1, 1])

and

target.data.size()
Out[8]: torch.Size([90, 1])

?

When the output size is changed to [90, 1], the results are the same…

richard · January 30, 2018, 5:19pm

That appears to be a bug, thank you for pointing it out. I’ve opened an issue here: https://github.com/pytorch/pytorch/issues/4938.

Relevant reading is broadcasting semantics; something seems off about mse_loss.

Peter_Ham · January 31, 2018, 1:23am

@rasemailcz make sure your output and target have exactly the same dimension. For example, if you have output of size [32,] but target of size [32,1], then the calculated loss is wrong. You can do this by output.view(target.size())

Peter_Ham · January 31, 2018, 1:25am

@richard it does not make sense to me since input.size()[0] is just the batch size, so you divide your total loss by the batch size. What’s wrong with this?

richard · January 31, 2018, 3:26am

loss = nn.MSELoss()
out = loss(x, t)

divides by the total number of elements in your tensor, which is different from the batch size.

Peter_Ham · January 31, 2018, 9:14am

thanks. usually people only divide that by the batch size I think?

richard · January 31, 2018, 2:38pm

My above explanation was for how nn.MSELoss is implemented. If you want to divide only by the batch size, you can do the following:

loss = nn.MSELoss(reduce=False)
out = loss(x, t).sum() / batch_size

Benlin_Liu · August 25, 2018, 10:48pm

it’s divided by the total number of elements in the input, i.e. nch*w for a 4d input, not batch size.

xuef · January 31, 2020, 3:51pm

@richard
do you have any comment to this related post