The PyTorch Quantization doc suggests that for efficient optimization, we must use a CPU that has AVX2 support or higher.
If we were to consider transformer class models trained/quantized and served on x86 architectures using FBGEMM as the Quantization Engine,
- Does INT8 quantization using native pytorch APIs take advantage of AVX512 instruction set, and if so in which version of PyTorch was this support introduced?
- If AVX512 is supported, what should we expect in terms of performance impact on latency and throughput by moving from AVX2 to AVX512 using native pytorch quantization support?
- Might there be any benchmarks which already show the quantified improvement?
Notes: - By Native pytorch quantization, we mean what is natively supported by PyTorch without export to ONNX runtime
- I See a discussion here on Vec512 and AVX512, and couldn’t quite follow if
- A) this support is already introduced, and
- B) what the actual performance difference is
Would be great to get some insight here.