Incorrect MSE loss for float16

loss_function = MSELoss()
loss_function(torch.tensor([0.0329]).to(torch.float16), torch.tensor([60000]).to(torch.float16))
--> tensor(inf, dtype=torch.float16)

why is the results inf?

float16 has a max range of +- 65504 and will overflow to +- Inf outside of this range.
It’s thus expected that nn.MSELoss will overflow via (0.03 - 60000)**2 ~= 3.6e9

Thank you for input. so its not possible to train a fp16 model for mse, since it’s going to be inf in most of the cases when loss is higher than 65k

Training any model in pure float16 is tricky as not only will large activation values potentially overflow but your training would also suffer from underflowing small gradients.
This is why we’ve developed the mixed-precision training util. via torch.amp, which not only uses an autocast context to transform tensors to float16 when it’s safe, but also uses a loss scaler to avoid underflows. Take a look at the AMP recipe and the examples to see how to use it.

I tried that but I get error the following error

RuntimeError: Found dtype Float but expected Half

Could you post a minimal, executable code snippet which would reproduce the error, please?

Can you tell me what part of code would be helpful? Since, it is very modular, now sure I can paste the whole code.

The first link already does it as it describes a simple network first with a standard training loop in default precision. In the next section autocast is added and afterwards the GradScaler both with code changes and with explanations why these utils. are used. Then the same initial code is posted again as All together: "Automatic Mixed Precision". Did you walk through this doc and got stuck somewhere?

Yes, I went through the doc. I think now the code works for me but I get error while updating the learning rate scheduler. is this the correct way to update learning rate scheduler

scaler.step(self.lr_scheduler)

or should I call it in the conventional way of self.lr_scheduler.step()
where as self.lr_scheduler.step() works for me.
I get the following error

'LambdaLR' object has no attribute 'param_groups'

scaler.step expects an optimizer, so use lr_scheduler.step() instead.

1 Like

For amp, my model is predicting inf. I’m getting inf only during amp training and not during full precision training. Because of inf I’m getting error in sklearn during accuracy computation during training.
Any thoughts, why this is happening?

The forward method should not overflow so I don’t know what might be causing it and would need to get more information about the model etc.

If you look at this part of the code from the doc:

        with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=use_amp):
            output = net(input)
aren't we telling the model output should always be float16? which will cause the overflow?