Strange MSELoss behaviour

I am setting up a ResNet model to make a regression problem (as I said in a previous topic). My performance is a little bit lower than expected. Right now I am using (criterion = nn.L1Loss()) but I am interested in (criterion = nn.MSELoss()) in order to add that quadratic component. However, when I make this change, I can see how the line “print(loss.item())” exponentially grows. In just a few iterations of the loop it ends up being Nan.

I have seen that this issue also happened to other people, but I couldn’t solved it since every case was totally different. I am using 300x300 (or even 200x200) images and LR = 0.1. When I decrease it to 0.03, loss numbers tend to be slightly lower, but it just takes one more iteration to go to Nan.

I don’t really know where could be the problem about this loss criterion. Below I attach the code I am running in case it helps.

Thank you in advance.

criterion = nn.MSELoss() #Delete existing conversion 

optimizer = optim.SGD(net.parameters(),, momentum=0.9, weight_decay=5e-4)
scheduler = optim.lr_scheduler.MultiStepLR(optimizer, milestones=[50, 150], gamma=0.1, last_epoch=-1)

def train(epoch, net):
    print('\nEpoch: {} ==> lr: {}'.format(epoch, scheduler.get_last_lr()))
    train_loss = 0
    for batch_idx, (inputs, targets) in enumerate(trainloader):
        inputs, targets =,
        targets =
        outputs = net(inputs)

        loss = criterion(outputs_mean, targets) 


        train_loss += loss.item()
        total += targets.size(0) #Related to display options
         #Loss of the epoch is calculated as: (train_loss/(batch_idx+1))

What is the problem you’re trying to solve? Are you trying to solve a segmentation/classification problem?
And what is outputs_mean? Is it a probability matrix with dimensions of [batch size, label] or is it a mask matrix with dimensions of [H, W] ? I think it would be easier to answer if you could elaborate on the question a bit more.

Thanks for your reply
Ok sorry I made some changes in the code and I forgot to fix that. The initial code was based on a classifier, that’s why there were some “useless” variables that weren’t needed at all. I edited the first post to make it clearer.

The input of the nework is one single channel (Black and White) image of [H,W]. The output of the network, right now, is one single continous variable that ranges from 60 to -60 degrees (related to the orientation of a robot). Therefore, the value “outputs” from inside the loop consists of an array that depends on batch_size (8) and the number of outputs of the network (1)

Example of printing “outputs” during one iteration of the for loop with batch size = 8:
[-51.5339]], device=‘cuda:0’, grad_fn=< AddmmBackward >)

I will first suggest using a lower learning rate (ex: 0.01/0.001). If MSE still diverge, than try to normalize the predictions and targets (divided by 60). If both doesn’t work, then you could train a model with smaller dataset(5%-10%), and see if you could overfit it. If you couldn’t, then your model might be wrong.

Thank you very much, now I know where can I check when I have these type of problems. Despite the changes, it is still not working. I have seen that the problem lies in the output of the network. During the first epoch, the output tends to be really high. As a consequence, the difference between the target and the predicted output (in order to calculate the loss) usually tend to Nan when the difference is squared (MSE) but not when there is no quadratic component in the loss formula (as L1Loss).

Since the targets are normalized from -1 to 1 now, I think will try to implement a sigmoid function at the end to avoid high output values and see if it works.

Edit: well, I tried it and it didn’t go as expected, since the loss is totally constant during all the epochs.
Edit2: I changed the optimizer from SVG to Adam and this one seems to be working with MSELoss, so problem solved.

1 Like