Can quantizing models enable you to have bigger batch sizes during inference?

miken · June 25, 2020, 5:31pm

I have a model in production. I want to be able to increase my model’s throughput for speed reasons. I’ve tried quantizing the model but for some reason, if increase the batch size I still run into an OOM error. I thought that quantizing the model from fp32 to say fp16 would allow the bigger batch sizes? If that’s not the case what is the use case of quantizing models?

supriyar · June 26, 2020, 1:03am

Quantizing model will enable to run the model at lower precision (int8) so it runs faster. Also since the tensors as quantized to 8 bit they will occupy less storage space as well.