USing Quantization tutorial,but the result different

BambooKui · February 3, 2025, 4:16pm

here is web:(beta) Static Quantization with Eager Mode in PyTorch — PyTorch Tutorials 2.6.0+cu124 documentation
In the final speed - testing section, the tutorial mentions:
“Running this locally on a MacBook Pro yielded 61 ms for the regular model, and just 20 ms for the quantized model, illustrating the typical 2 - 4x speedup we see for quantized models compared to floating - point ones.”
However, when I ran it, I got 15 ms for the regular model and 30 ms for the quantized model. The quantized model runs slower.
Why? I’m asking for help.

anantguptadbl · February 4, 2025, 6:48am

@BambooKui theoretically this is possible in the following scenarios

You have almost exhausted your machine memory and when you ran the prediction on the quantized model, the memory bottleneck kicked in

In order to test this hypothesis

Save both the models in separate files
Terminate the python session
Open another session, load just the quantized model and check the benchmark timings

BambooKui · February 4, 2025, 12:05pm

Thank you for your reply. I ran the inference part in a separate process. The quantized model was twice as fast, which was faster than the non - quantized model, but the improvement wasn’t as significant as described in the tutorials.