I’m trying to understand how MSELoss() is implemented. Usually people will think MSELoss is (input-target)**2.sum()/batch_size, but when I explicitly write this as the loss function, it turns out that it actually leads to very different training curve from if I use nn.MSELoss()
nn.MSELoss()
is implemented by default as: ((input-target)**2).mean()
Thanks. What’s the difference between this and (input-target).pow(2).sum()/input.size()[0]
?
((input-target)**2).mean()
is equivalent to
sizes = input.size()
(input-target).pow(2).sum()/ input.numel()
input.numel()
doesn’t have to equal input.size()[0]
unless input
is a 1D tensor.
Hi,
I tried the following loss functions
output = model(data)
train_loss1 = F.mse_loss(output, target.cuda(), True)
train_loss2 = ((output - target.cuda())**2).mean()
and got different results. Also, the training curve is different.
How is that possible, please?
If training is different, the computation of gradients is probably different, assuming you’re using the same optimizer.
Can you check if your gradients (or the output) is different? Are there any random parts to your model (random initialization, etc?) Weights to nn
layers are usually randomly initialized.
Training is different, but that does not matter actually.
What matters is that train_loss1
and train_loss2
are different numbers, though I would expect them to be equal. How can this happen?
@rasemailcz could you show me some inputs where train_loss1
and train_loss2
are different?
output = -0.0226 -0.1922 0.1461 -0.0481 -0.0526 -0.0449 0.1586 -0.0980 0.2177 0.0402 -0.1200 0.0010 0.0022 0.0430 0.1930 0.2304 0.0043 0.0659 0.2427 0.1580 -0.0449 -0.0477 0.1961 0.2336 -0.0308 0.0648 0.0669 0.0072 0.0353 -0.2793 0.0105 -0.1510 -0.0942 -0.1761 0.0477 -0.0564 -0.1628 0.0467 -0.0819 -0.2643 0.1066 -0.0952 0.0918 -0.0934 0.1405 -0.1959 -0.0477 -0.1138 -0.1032 -0.0622 -0.0658 0.2957 -0.1170 -0.1541 0.1663 0.2635 -0.1477 0.2634 0.0940 -0.0477 -0.1920 0.0104 0.3450 -0.0514 -0.1592 0.2188 -0.3998 -0.1696 -0.1194 -0.3216 -0.0702 0.0074 -0.0223 0.0597 0.0329 -0.1500 0.2207 0.1900 -0.0688 0.3510 0.1114 -0.0829 0.0919 -0.1787 0.0266 -0.2059 0.0821 -0.1061 0.1190 0.0090
target.cuda() = 0.3100 0.0638 -0.1865 -0.4252 -0.6371 -0.8090 -0.9300 -0.9922 -0.9919 -0.9289 -0.8072 -0.6343 -0.4212 -0.1814 0.0700 0.3170 0.5440 0.7365 0.8823 0.9720 0.9999 0.9641 0.8669 0.7143 0.5162 0.2850 0.0355 -0.2163 -0.4543 -0.6634 -0.8300 -0.9435 -0.9965 -0.9855 -0.9113 -0.7784 -0.5955 -0.3742 -0.1287 0.1252 0.3711 0.5931 0.7769 0.9105 0.9854 0.9965 0.9431 0.8286 0.6604 0.4493 0.2089 -0.0451 -0.2962 -0.5282 -0.7259 -0.8763 -0.9697 -1.0000 -0.9649 -0.8669 -0.7122 -0.5110 -0.2762 -0.0232 0.2313 0.4707 0.6794 0.8435 0.9523 0.9986 0.9792 0.8954 0.7526 0.5602 0.3308 0.0794 -0.1772 -0.4222 -0.6394 -0.8143 -0.9355 -0.9947 -0.9880 -0.9158 -0.7828 -0.5979 -0.3731 -0.1234 0.1345 0.3836
The following script:
import torch
import torch.nn.functional as F
from torch.autograd import Variable
output = Variable(torch.Tensor(
[-0.0226, -0.1922, 0.1461, -0.0481, -0.0526, -0.0449, 0.1586, -0.0980, 0.2177,
0.0402, -0.1200, 0.0010, 0.0022, 0.0430, 0.1930, 0.2304, 0.0043, 0.0659, 0.2427,
0.1580, -0.0449, -0.0477, 0.1961, 0.2336, -0.0308, 0.0648, 0.0669, 0.0072, 0.0353,
-0.2793, 0.0105, -0.1510, -0.0942, -0.1761, 0.0477, -0.0564, -0.1628, 0.0467,
-0.0819, -0.2643, 0.1066, -0.0952, 0.0918, -0.0934, 0.1405, -0.1959, -0.0477,
-0.1138, -0.1032, -0.0622, -0.0658, 0.2957, -0.1170, -0.1541, 0.1663, 0.2635,
-0.1477, 0.2634, 0.0940, -0.0477, -0.1920, 0.0104, 0.3450, -0.0514, -0.1592, 0.2188,
-0.3998, -0.1696, -0.1194, -0.3216, -0.0702, 0.0074, -0.0223, 0.0597, 0.0329,
-0.1500, 0.2207, 0.1900, -0.0688, 0.3510, 0.1114, -0.0829, 0.0919, -0.1787, 0.0266,
-0.2059, 0.0821, -0.1061, 0.1190, 0.0090]), requires_grad=True)
target = Variable(torch.Tensor(
[0.3100, 0.0638, -0.1865, -0.4252, -0.6371, -0.8090, -0.9300, -0.9922, -0.9919,
-0.9289, -0.8072, -0.6343, -0.4212, -0.1814, 0.0700, 0.3170, 0.5440, 0.7365, 0.8823,
0.9720, 0.9999, 0.9641, 0.8669, 0.7143, 0.5162, 0.2850, 0.0355, -0.2163, -0.4543,
-0.6634, -0.8300, -0.9435, -0.9965, -0.9855, -0.9113, -0.7784, -0.5955, -0.3742,
-0.1287, 0.1252, 0.3711, 0.5931, 0.7769, 0.9105, 0.9854, 0.9965, 0.9431, 0.8286,
0.6604, 0.4493, 0.2089, -0.0451, -0.2962, -0.5282, -0.7259, -0.8763, -0.9697,
-1.0000, -0.9649, -0.8669, -0.7122, -0.5110, -0.2762, -0.0232, 0.2313, 0.4707,
0.6794, 0.8435, 0.9523, 0.9986, 0.9792, 0.8954, 0.7526, 0.5602, 0.3308, 0.0794,
-0.1772, -0.4222, -0.6394, -0.8143, -0.9355, -0.9947, -0.9880, -0.9158, -0.7828,
-0.5979, -0.3731, -0.1234, 0.1345, 0.3836]))
train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()
gives
In [8]: train_loss1
Out[8]:
Variable containing:
0.5171
[torch.FloatTensor of size (1,)]
In [9]: train_loss2
Out[9]:
Variable containing:
0.5171
[torch.FloatTensor of size (1,)]
I’d also expect it to give the same on the gpu. Does your problem occur when your inputs are on the cpu?
I found the reason, but I must have missed it in the documentation - can you explain me (redirect me to the documentation), why the output of the
train_loss1 = F.mse_loss(output, target, True)
train_loss2 = ((output - target)**2).mean()
is different where the sizes of the tensors are:
output.data.size()
Out[9]: torch.Size([90, 1, 1, 1])
and
target.data.size()
Out[8]: torch.Size([90, 1])
?
When the output
size is changed to [90, 1], the results are the same…
That appears to be a bug, thank you for pointing it out. I’ve opened an issue here: https://github.com/pytorch/pytorch/issues/4938.
Relevant reading is broadcasting semantics; something seems off about mse_loss
.
@rasemailcz make sure your output and target have exactly the same dimension. For example, if you have output of size [32,] but target of size [32,1], then the calculated loss is wrong. You can do this by output.view(target.size())
@richard it does not make sense to me since input.size()[0]
is just the batch size, so you divide your total loss by the batch size. What’s wrong with this?
loss = nn.MSELoss()
out = loss(x, t)
divides by the total number of elements in your tensor, which is different from the batch size.
thanks. usually people only divide that by the batch size I think?
My above explanation was for how nn.MSELoss
is implemented. If you want to divide only by the batch size, you can do the following:
loss = nn.MSELoss(reduce=False)
out = loss(x, t).sum() / batch_size
it’s divided by the total number of elements in the input, i.e. nch*w for a 4d input, not batch size.
@richard
do you have any comment to this related post