How is the MSELoss() implemented?

I’m trying to understand how MSELoss() is implemented. Usually people will think MSELoss is (input-target)**2.sum()/batch_size, but when I explicitly write this as the loss function, it turns out that it actually leads to very different training curve from if I use nn.MSELoss()

3 Likes

nn.MSELoss() is implemented by default as: ((input-target)**2).mean()

5 Likes

Thanks. What’s the difference between this and (input-target).pow(2).sum()/input.size()[0]?

((input-target)**2).mean() is equivalent to

sizes = input.size()
(input-target).pow(2).sum()/ input.numel()

input.numel() doesn’t have to equal input.size()[0] unless input is a 1D tensor.

4 Likes

Hi,

I tried the following loss functions

output = model(data)
train_loss1 = F.mse_loss(output, target.cuda(), True)
train_loss2 = ((output - target.cuda())**2).mean()

and got different results. Also, the training curve is different.

How is that possible, please?

1 Like

If training is different, the computation of gradients is probably different, assuming you’re using the same optimizer.

Can you check if your gradients (or the output) is different? Are there any random parts to your model (random initialization, etc?) Weights to nn layers are usually randomly initialized.

Training is different, but that does not matter actually.

What matters is that train_loss1 and train_loss2 are different numbers, though I would expect them to be equal. How can this happen?

@rasemailcz could you show me some inputs where train_loss1 and train_loss2 are different?

@richard

output = -0.0226 -0.1922 0.1461 -0.0481 -0.0526 -0.0449 0.1586 -0.0980 0.2177 0.0402 -0.1200 0.0010 0.0022 0.0430 0.1930 0.2304 0.0043 0.0659 0.2427 0.1580 -0.0449 -0.0477 0.1961 0.2336 -0.0308 0.0648 0.0669 0.0072 0.0353 -0.2793 0.0105 -0.1510 -0.0942 -0.1761 0.0477 -0.0564 -0.1628 0.0467 -0.0819 -0.2643 0.1066 -0.0952 0.0918 -0.0934 0.1405 -0.1959 -0.0477 -0.1138 -0.1032 -0.0622 -0.0658 0.2957 -0.1170 -0.1541 0.1663 0.2635 -0.1477 0.2634 0.0940 -0.0477 -0.1920 0.0104 0.3450 -0.0514 -0.1592 0.2188 -0.3998 -0.1696 -0.1194 -0.3216 -0.0702 0.0074 -0.0223 0.0597 0.0329 -0.1500 0.2207 0.1900 -0.0688 0.3510 0.1114 -0.0829 0.0919 -0.1787 0.0266 -0.2059 0.0821 -0.1061 0.1190 0.0090

target.cuda() = 0.3100 0.0638 -0.1865 -0.4252 -0.6371 -0.8090 -0.9300 -0.9922 -0.9919 -0.9289 -0.8072 -0.6343 -0.4212 -0.1814 0.0700 0.3170 0.5440 0.7365 0.8823 0.9720 0.9999 0.9641 0.8669 0.7143 0.5162 0.2850 0.0355 -0.2163 -0.4543 -0.6634 -0.8300 -0.9435 -0.9965 -0.9855 -0.9113 -0.7784 -0.5955 -0.3742 -0.1287 0.1252 0.3711 0.5931 0.7769 0.9105 0.9854 0.9965 0.9431 0.8286 0.6604 0.4493 0.2089 -0.0451 -0.2962 -0.5282 -0.7259 -0.8763 -0.9697 -1.0000 -0.9649 -0.8669 -0.7122 -0.5110 -0.2762 -0.0232 0.2313 0.4707 0.6794 0.8435 0.9523 0.9986 0.9792 0.8954 0.7526 0.5602 0.3308 0.0794 -0.1772 -0.4222 -0.6394 -0.8143 -0.9355 -0.9947 -0.9880 -0.9158 -0.7828 -0.5979 -0.3731 -0.1234 0.1345 0.3836

The following script:

import torch
import torch.nn.functional as F
from torch.autograd import Variable

output = Variable(torch.Tensor(
[-0.0226, -0.1922, 0.1461, -0.0481, -0.0526, -0.0449, 0.1586, -0.0980, 0.2177, 
0.0402, -0.1200, 0.0010, 0.0022, 0.0430, 0.1930, 0.2304, 0.0043, 0.0659, 0.2427, 
0.1580, -0.0449, -0.0477, 0.1961, 0.2336, -0.0308, 0.0648, 0.0669, 0.0072, 0.0353,
-0.2793, 0.0105, -0.1510, -0.0942, -0.1761, 0.0477, -0.0564, -0.1628, 0.0467, 
-0.0819, -0.2643, 0.1066, -0.0952, 0.0918, -0.0934, 0.1405, -0.1959, -0.0477,
-0.1138, -0.1032, -0.0622, -0.0658, 0.2957, -0.1170, -0.1541, 0.1663, 0.2635,
-0.1477, 0.2634, 0.0940, -0.0477, -0.1920, 0.0104, 0.3450, -0.0514, -0.1592, 0.2188,
-0.3998, -0.1696, -0.1194, -0.3216, -0.0702, 0.0074, -0.0223, 0.0597, 0.0329, 
-0.1500, 0.2207, 0.1900, -0.0688, 0.3510, 0.1114, -0.0829, 0.0919, -0.1787, 0.0266, 
-0.2059, 0.0821, -0.1061, 0.1190, 0.0090]), requires_grad=True)

target = Variable(torch.Tensor(
[0.3100, 0.0638, -0.1865, -0.4252, -0.6371, -0.8090, -0.9300, -0.9922, -0.9919, 
-0.9289, -0.8072, -0.6343, -0.4212, -0.1814, 0.0700, 0.3170, 0.5440, 0.7365, 0.8823,
0.9720, 0.9999, 0.9641, 0.8669, 0.7143, 0.5162, 0.2850, 0.0355, -0.2163, -0.4543, 
-0.6634, -0.8300, -0.9435, -0.9965, -0.9855, -0.9113, -0.7784, -0.5955, -0.3742,
-0.1287, 0.1252, 0.3711, 0.5931, 0.7769, 0.9105, 0.9854, 0.9965, 0.9431, 0.8286, 
0.6604, 0.4493, 0.2089, -0.0451, -0.2962, -0.5282, -0.7259, -0.8763, -0.9697, 
-1.0000, -0.9649, -0.8669, -0.7122, -0.5110, -0.2762, -0.0232, 0.2313, 0.4707, 
0.6794, 0.8435, 0.9523, 0.9986, 0.9792, 0.8954, 0.7526, 0.5602, 0.3308, 0.0794,
-0.1772, -0.4222, -0.6394, -0.8143, -0.9355, -0.9947, -0.9880, -0.9158, -0.7828,
-0.5979, -0.3731, -0.1234, 0.1345, 0.3836]))

train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()

gives

In [8]: train_loss1
Out[8]:
Variable containing:
 0.5171
[torch.FloatTensor of size (1,)]

In [9]: train_loss2
Out[9]:
Variable containing:
 0.5171
[torch.FloatTensor of size (1,)]

I’d also expect it to give the same on the gpu. Does your problem occur when your inputs are on the cpu?

@richard

I found the reason, but I must have missed it in the documentation - can you explain me (redirect me to the documentation), why the output of the

train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()

is different where the sizes of the tensors are:

output.data.size()
Out[9]: torch.Size([90, 1, 1, 1])

and

target.data.size()
Out[8]: torch.Size([90, 1])

?

When the output size is changed to [90, 1], the results are the same…

That appears to be a bug, thank you for pointing it out. I’ve opened an issue here: https://github.com/pytorch/pytorch/issues/4938.

Relevant reading is broadcasting semantics; something seems off about mse_loss.

1 Like

@rasemailcz make sure your output and target have exactly the same dimension. For example, if you have output of size [32,] but target of size [32,1], then the calculated loss is wrong. You can do this by output.view(target.size())

@richard it does not make sense to me since input.size()[0] is just the batch size, so you divide your total loss by the batch size. What’s wrong with this?

loss = nn.MSELoss()
out = loss(x, t)

divides by the total number of elements in your tensor, which is different from the batch size.

thanks. usually people only divide that by the batch size I think?

My above explanation was for how nn.MSELoss is implemented. If you want to divide only by the batch size, you can do the following:

loss = nn.MSELoss(reduce=False)
out = loss(x, t).sum() / batch_size
2 Likes

it’s divided by the total number of elements in the input, i.e. nch*w for a 4d input, not batch size.

1 Like

@richard
do you have any comment to this related post