The model size regarding its parameters won’t be changed, as the operations (and intermediates) will be casted to FP16 for “safe” ops.
So while you might be able to increase the batch size during training or inference, the state_dict won’t be smaller in size.
Which batch size are you using for inference? If you are seeing a speedup during training, you should also see it during inference. However, if your batch size is low (e.g. a single sample), the performance gain might be too small compared to the overheads of launching all kernels.
Automatic mixed precision is implemented for CUDA operations (and is thus in the torch.cuda namespace). By appying amp your GPU could use TensorCores for certain operations, which would yield a speedup. I don’t know if anything like that is implemented for CPU operations (and if I’m not mistaken not all operations are implemented for HalfTensors on the CPU).
thanks for your response. No, I am not using dynamic quantization. But, since I cannot do mixed precision on cpu. Hence, I guess for CPU I have to switch to dynamic quantization. Am I right?
I’ve unfortunately never profiled the quantized models, so unsure what the expected speedup is.
However, please let us know once if you are using the post-training quantized models and how large the performance gain is.