I have a model in production. I want to be able to increase my model’s throughput for speed reasons. I’ve tried quantizing the model but for some reason, if increase the batch size I still run into an OOM error. I thought that quantizing the model from fp32 to say fp16 would allow the bigger batch sizes? If that’s not the case what is the use case of quantizing models?
Quantizing model will enable to run the model at lower precision (int8) so it runs faster. Also since the tensors as quantized to 8 bit they will occupy less storage space as well.