I’ve been running tests with the compile feature of Torch 2.0 and I’m observing impressive gains on inference speed for Transformer. Nonetheless, I’ve tried quantizing compiled models without success (there’s no gain observed between the normal and quantized model).
Is that expected? If not, is there future work aiming to tackle this issue?
Thank you very much!
Are you trying to compare gains between FP32 model and INT8 model, both with torch.compile enabled?
What are the results of quantizing the model without torch.compile?
Which backend are you running the model on (server CPU or something else)?
We haven’t added support to accelerate quantized models with torch.compile yet. Typically the compute intensive quantized ops have device specific implementation (fbgemm for server cpu, qnnpack/xnnpack for edge) which might not benefit from this. For others, we might get some speedup with torch.compile but this hasn’t been explored yet.
Thanks for getting back.
Yes that’s it, I’m trying to compare between FP32 and INT8 models both with torch.compile. I was hoping to observe a similar gain than what we get for FP32 and INT8 models without using torch.compile.
I’m running tests on a XLM-RoBERTa transformer on server CPU and observe a 45% gain in inference between unquatized and quantized models.
When using torch.compile, inference speed stays the same but I guess this is expected given your reply. Nonetheless, it would be interesting for torch.compile to support quantized models to obtain even better results!
We (Intel) are working on the design (collaborating with Meta folks) and PoC ([Experiment][Inductor][Quant] inductor as a quantization backend for PT 2.0 quantization by Xia-Weiwen · Pull Request #91226 · pytorch/pytorch · GitHub) to enable x86 CPU quantization optimization in PT2.0 TorchInductor. It is still in an early stage though. Welcome to any feedbacks.
Hey Jiong, thanks a lot for letting me know, I’m looking forward to seeing the results.