Incorrect MSE loss for float16

Anandh_Perumal_Konar · September 8, 2022, 12:36am

loss_function = MSELoss()
loss_function(torch.tensor([0.0329]).to(torch.float16), torch.tensor([60000]).to(torch.float16))
--> tensor(inf, dtype=torch.float16)

why is the results inf?

ptrblck · September 8, 2022, 1:07am

float16 has a max range of +- 65504 and will overflow to +- Inf outside of this range.
It’s thus expected that nn.MSELoss will overflow via (0.03 - 60000)**2 ~= 3.6e9

Anandh_Perumal_Konar · September 8, 2022, 1:25am

Thank you for input. so its not possible to train a fp16 model for mse, since it’s going to be inf in most of the cases when loss is higher than 65k

ptrblck · September 8, 2022, 4:20am

Training any model in pure float16 is tricky as not only will large activation values potentially overflow but your training would also suffer from underflowing small gradients.
This is why we’ve developed the mixed-precision training util. via torch.amp, which not only uses an autocast context to transform tensors to float16 when it’s safe, but also uses a loss scaler to avoid underflows. Take a look at the AMP recipe and the examples to see how to use it.

Anandh_Perumal_Konar · September 8, 2022, 4:37am

I tried that but I get error the following error

RuntimeError: Found dtype Float but expected Half

ptrblck · September 8, 2022, 4:43am

Could you post a minimal, executable code snippet which would reproduce the error, please?

Anandh_Perumal_Konar · September 8, 2022, 2:22pm

Can you tell me what part of code would be helpful? Since, it is very modular, now sure I can paste the whole code.

ptrblck · September 8, 2022, 3:39pm

The first link already does it as it describes a simple network first with a standard training loop in default precision. In the next section autocast is added and afterwards the GradScaler both with code changes and with explanations why these utils. are used. Then the same initial code is posted again as All together: "Automatic Mixed Precision". Did you walk through this doc and got stuck somewhere?

Anandh_Perumal_Konar · September 9, 2022, 4:40pm

Yes, I went through the doc. I think now the code works for me but I get error while updating the learning rate scheduler. is this the correct way to update learning rate scheduler

scaler.step(self.lr_scheduler)

or should I call it in the conventional way of self.lr_scheduler.step()
where as self.lr_scheduler.step() works for me.
I get the following error

'LambdaLR' object has no attribute 'param_groups'

ptrblck · September 9, 2022, 5:30pm

scaler.step expects an optimizer, so use lr_scheduler.step() instead.

Anandh_Perumal_Konar · September 9, 2022, 9:17pm

For amp, my model is predicting inf. I’m getting inf only during amp training and not during full precision training. Because of inf I’m getting error in sklearn during accuracy computation during training.
Any thoughts, why this is happening?

ptrblck · September 10, 2022, 12:58am

The forward method should not overflow so I don’t know what might be causing it and would need to get more information about the model etc.

Anandh_Perumal_Konar · September 10, 2022, 1:39am

If you look at this part of the code from the doc:

        with torch.autocast(device_type='cuda', dtype=torch.float16, enabled=use_amp):
            output = net(input)
aren't we telling the model output should always be float16? which will cause the overflow?