here is web:(beta) Static Quantization with Eager Mode in PyTorch — PyTorch Tutorials 2.6.0+cu124 documentation
In the final speed - testing section, the tutorial mentions:
“Running this locally on a MacBook Pro yielded 61 ms for the regular model, and just 20 ms for the quantized model, illustrating the typical 2 - 4x speedup we see for quantized models compared to floating - point ones.”
However, when I ran it, I got 15 ms for the regular model and 30 ms for the quantized model. The quantized model runs slower.
Why? I’m asking for help.
@BambooKui theoretically this is possible in the following scenarios
- You have almost exhausted your machine memory and when you ran the prediction on the quantized model, the memory bottleneck kicked in
In order to test this hypothesis
- Save both the models in separate files
- Terminate the python session
- Open another session, load just the quantized model and check the benchmark timings
Thank you for your reply. I ran the inference part in a separate process. The quantized model was twice as fast, which was faster than the non - quantized model, but the improvement wasn’t as significant as described in the tutorials.