Impact on Performance when model is trained at float32 and inference is done at float16 (half precision)?

I am trying to train the model in float32 and inference in float16. I think my code is working fine. I wanted to know if anyone else has done this to reduce the size of the model and is there any change in accuracy while decreasing the precision? I know that this helps to increase the speed a little but I am worried about the performance.

To be save, you could use native mixed-precision via torch.cuda.amp, which will make sure that only “safe” operations are performed in FP16.
Calling half() on your model directly might work in a lot of use cases, but might also yield overflows etc., so you would have to verify your model manually.

If you want to manually use FP16, you could check e.g. the validation accuracy before deploying the model as a quick sanity check.

1 Like