Model weight size and inference time for fp16 model using apex mixed precision with optimization level O3

Hi,

I have successfully trained a model using Nvidia apex mixed precision plugin and saved weights as .pth
I used optmization level O3. So I expect my model weights should be saved as FP16 weights.
However, I see the size of .pth file is same as that of a FP32 trained model.

Could you please help me understand this?

Also, I haven’t timed the inference, but can I expect the inference to be faster than FP32?

My idea behind this is - I wish to use Tensorrt FP16, default path is train FP32, convert/ quantize to FP16 using torch-tensorrt. However, I see some performance degradation. Hence, may be if I train natively in FP16 and then do the conversion, it should give me better results.

Is this assumption sensible?

Please let me know.

Best Regards

apex.amp is deprecated and the O3 opt_level was used for “pure” FP16 training, which you could achieve by directly calling .half() on the model and input data without using apex.amp.